Skip to content

Conversation

@mjc
Copy link
Owner

@mjc mjc commented Dec 5, 2025

Fix: Dashboard timeout handling and critical architecture improvements

Summary

This PR addresses dashboard query timeout issues and implements several critical improvements to database concurrency, code clarity, and system reliability.

Problem Statement

  • Dashboard queries timing out after 15s when SQLite is busy during high-concurrency operations
  • Encoder breaking single-worker design by spawning multiple processes
  • Videos with Opus/AV1 in filename still being analyzed instead of marked as encoded
  • Confusing function naming causing maintenance issues and bugs

Changes

1. Dashboard Query Timeout Improvements

  • Reduced timeout from 15s → 2s for faster failure detection
  • Added safe query wrappers that return default values on errors
  • Synchronous queries in test mode to prevent Ecto Sandbox ownership issues
  • Nested error handling for DBConnection.ConnectionError in exit tuples

Files: lib/reencodarr_web/live/dashboard_live.ex

2. Encoder Single-Worker Protection

  • Delegated to Encode GenServer instead of direct command execution
  • Removed 500+ lines of duplicated logic from Broadway pipeline
  • Enforces single concurrent encoding to prevent VMAF quality degradation

Files: lib/reencodarr/encoder/broadway.ex

3. Database Concurrency & Retry Logic

  • Created centralized retry module (Reencodarr.Core.Retry)
  • Exponential backoff with jitter (base 100ms, max 5 attempts)
  • Applied to critical operations:
    • Video upserts during sync (2631 videos in batches)
    • VMAF upserts during CRF search
    • State machine transitions
  • Formula: base = (2^attempt * 100ms); jitter = rand(base/2); backoff = base + jitter

Files:

  • lib/reencodarr/core/retry.ex (NEW)
  • lib/reencodarr/media/video_upsert.ex
  • lib/reencodarr/media/video_state_machine.ex
  • lib/reencodarr/ab_av1/crf_search.ex

4. Performance Monitor Idle Spam Fix

  • Only logs when throughput > 0 (prevents "consecutive improvements: 87" spam)
  • 30s adjustment interval for Broadway concurrency tuning

Files: lib/reencodarr/analyzer/broadway/performance_monitor.ex

5. Mediainfo Field Validation

  • Validates bitrate/width/height before :analyzed transition
  • Prevents validation errors like bitrate: {"can't be blank"}
  • Checks required fields in persist_analyzed_state

Files: lib/reencodarr/analyzer/broadway.ex

6. Filename-Based Codec Detection

  • Added has_opus_in_filename? check to decision logic
  • Marks videos as :encoded when "Opus" or "AV1" detected in filename
  • Skips unnecessary mediainfo analysis for already-encoded content

Files: lib/reencodarr/analyzer/broadway.ex

7. Function Naming Refactor (Major Code Quality)

Before (confusing):

  • transition_video_to_analyzed - actually decides :analyzed OR :encoded
  • determine_video_transition_decision - returns decision tuple
  • persist_analyzed_state - DB persistence
  • persist_encoded_state - DB persistence
  • Media.mark_as_analyzed - context wrapper
  • VideoStateMachine.mark_as_analyzed - state machine logic

After (clear):

  • decide_video_processing_path - decides which state path
  • check_encoding_requirements - pure logic, returns {:skip_encoding, reason} or {:needs_encoding, reason}
  • apply_processing_decision - single DB persistence function

Impact: Eliminates maintenance nightmare of 4+ similarly-named functions doing completely different things.

Files:

  • lib/reencodarr/analyzer/broadway.ex
  • test/reencodarr/analyzer/broadway/codec_detection_test.exs

Test Coverage

  • 877 tests, 0 failures
  • Updated all test assertions for new function names
  • Fixed test expectations (dashboard text changes)
  • Added comprehensive edge case tests for Media module
  • Added Rules module coverage tests

Performance Impact

  • Dashboard: 2s timeout prevents long-hanging queries
  • Sync: Retry logic handles 2631 videos without database busy errors
  • CRF Search: VMAF upserts retry on contention
  • Analyzer: Skips mediainfo for Opus/AV1 files (faster processing)

Breaking Changes

None - all changes are internal refactoring or improvements.

Migration Notes

No database migrations required. Retry logic is transparent to existing code.

Related Issues

Fixes issues with:

  • Dashboard timeouts during sync operations
  • Encoder concurrent process violations
  • Videos stuck in analysis with Opus/AV1 already present
  • Code maintainability and function naming confusion

mjc added 29 commits November 10, 2025 11:53
The dashboard polls regularly, so long timeouts are unnecessary and cause
DBConnection errors when SQLite is busy. This change:

- Sets dashboard queries to 2s timeout instead of default 15s
- Catches and silently ignores timeouts in dashboard updates
- Updates VideoQueries functions to accept optional timeout parameter
- Inlines count queries in dashboard to apply timeout directly

Since the dashboard polls every few seconds, skipping an update when the
database is busy is acceptable - the next poll will retry.
Wrap all dashboard queries with rescue/catch to handle timeouts gracefully:
- safe_query/1 returns 0 on timeout for count queries
- safe_query_list/1 returns [] on timeout for list queries
- Prevents error logging when SQLite is busy during sync operations

Dashboard will show default values during timeouts and update on next poll.
…protection

The encoder was running multiple concurrent processes because Broadway
was doing the encoding work directly instead of delegating to the
Encode GenServer. This caused duplicate encodings.

Changes:
- Simplified Encoder.Broadway to match CRF searcher pattern
- Delegate all encoding work to AbAv1.encode/1 (Encode GenServer)
- Remove direct port handling from Broadway (now in GenServer)
- Encode GenServer's available?/0 check ensures only one encode at a time
- Remove empty lock.ex file

This matches the CRF searcher architecture which works correctly.
Build Erlang 28.1.1 from source to get the latest bugfixes that
may address segfault issues seen with earlier OTP 28 builds.
- Prevents DB connection owner exit errors during test cleanup
- Async tasks in tests can outlive the LiveView process causing Sandbox issues
- Added ReencodarrWeb.TaskSupervisor for production async queries
- Removed unnecessary try/catch and Process.alive? checks
- Tests now run cleanly without connection warnings
- Fixed encoder test helper to use fully qualified module name
- Skip MP4 files temporarily due to compatibility issues
- Created SVG favicon with Sonarr/Radarr family aesthetic
- Purple gradient background with radar-style concentric circles
- Scanning arc sweep for dynamic feel
- AV1 triangle and OPUS wave symbols integrated
- Gold accent color matching Radarr theme
- Hooked up to root layout template
- Purple gradient background (from-purple-900 to-purple-800)
- Glassmorphism cards with backdrop blur and white/10 transparency
- Updated all text colors to white/purple shades for contrast
- Progress bars with gradient fills and glow effects
- Larger icon in header (16x16) with drop shadow
- Changed encoder color from green to amber (matching favicon gold)
- Updated sync service component styling
- All buttons now have shadow-lg for depth
Add comprehensive test suite for SharedQueries:

- case_insensitive_like/2: Database-agnostic LIKE queries
- videos_not_matching_exclude_patterns/1: Pattern filtering (small/large lists)
- videos_with_no_chosen_vmafs_query/0: VMAF cleanup queries
- aggregated_stats_query/0: Statistics calculations for dashboard

Coverage: 64.29% → 100% (+35.71%)
Tests: 4 → 20 tests
Use underscores for large number: 12345 → 12_345
- Fix delete_videos_with_path test to use wildcard pattern matching
- Fix delete_unchosen_vmafs test to correctly test videos without chosen VMAFs
- Fix list_chosen_vmafs test to set videos to :crf_searched state
- Fix mark_vmaf_as_chosen test to handle transaction return type
- All 74 Media tests now passing in async mode
…tions

- Add tests for queue operations: get_videos_for_crf_search, get_videos_needing_analysis, list_videos_by_estimated_percent, get_next_for_encoding_by_time, debug_encoding_queue_by_library
- Add tests for library operations: create_library with valid/invalid attrs
- Add tests for vmaf operations: create_vmaf with valid/invalid attrs
- Add tests for bulk operations: reset_all_failures
- Add test for test helpers: test_insert_path
- Fix bug in Media.add_existing_video_messages to handle {:error, :not_found} tuple from get_video_by_path
- Coverage improved from 40.60% to 58.60% (+18%)
…operations

- Add tests for update operations: update_video, update_library, update_vmaf
- Add tests for upsert operations: upsert_vmaf (create/update), batch_upsert_videos
- Add tests for reanalysis operations: force_reanalyze_video, debug_force_analyze_video, reset_failed_videos, reset_all_videos_for_reanalysis, reset_all_videos_to_needs_analysis
- Add tests for invalid audio operations: count_videos_with_invalid_audio_args, reset_videos_with_invalid_audio_args
- Add tests for deletion operations: delete_video, delete_library, delete_vmaf, delete_videos_with_nonexistent_paths
- Add tests for changeset operations: change_video, change_library, change_vmaf
- Add tests for video failure operations: get_video_failures, resolve_video_failures
- Skip test for reset_videos_with_invalid_audio_metadata (PostgreSQL-specific array_length)
- Fix credo warnings: replace length/1 with Enum.empty?/1
- Coverage improved from 58.60% to 70.18% (+11.58%)
- Total: 108 tests (23 new tests added)
- Replace PostgreSQL-specific array_length(?, 1) with SQLite json_array_length(?)
- Fix reset_videos_with_invalid_audio_metadata/0 to work with SQLite JSON arrays
- Unskip previously failing test - now passes with SQLite-compatible query
- This was a critical bug preventing the function from working on SQLite databases
…s, and diagnostics

- Added 19 new tests covering previously untested functions
- Coverage improved from 61.26% to 67.80% (+6.54%)
- Tests for list_videos_awaiting_crf_search, get_next_for_encoding
- Tests for delete_vmafs_for_video, mark_vmaf_as_chosen with string CRF
- Tests for upsert_vmaf savings calculations and state transitions
- Tests for bulk operations: reset_all_videos_for_reanalysis, reset_videos_for_reanalysis_batched
- Tests for reset_all_videos_to_needs_analysis, debug_force_analyze_video
- Tests for test_insert_path diagnostic function with various scenarios
- All 127 tests passing with comprehensive edge case coverage
- Added 3 tests for upsert_video/1: create, update, and error cases
- Tests verify upsert behavior on path conflict
- Coverage remains at 67.80% (130 tests total)
- Delegates like mark_as_reencoded already covered by VideoStateMachine tests
- Added 11 new tests for upsert_vmaf edge cases and savings calculations
- Tests for zero size videos, invalid percent values, percent over 100
- Tests for pre-existing savings, VMAF upsert conflicts
- Added edge case tests for bulk operations (failed videos, bitrate resets)
- Tests for debug_force_analyze_video field resets
- Tests for multiple video failures and failure resolution
- Tests for upsert_video timestamp preservation
- Coverage improved from 67.80% to 68.77% (+0.97%)
- Total: 141 tests passing
- Added error handling test for force_reanalyze_video with non-existent video
- Added test for delete_unchosen_vmafs empty case
- Added test for get_video_by_service_id type handling
- Coverage improved from 68.77% to 69.25% (+0.48%)
- Total: 144 tests passing
- Added 12 new tests for helper functions and transaction paths
- Tests for delete_videos_with_path wildcard patterns and empty matches
- Tests for reset_videos_with_invalid_audio_args edge cases
- Tests for reset_all_failures with no failures and multiple failures
- Tests for get_next_for_encoding_by_time sorting and empty cases
- Tests for list_videos_awaiting_crf_search filtering logic
- Tests for mark_vmaf_as_chosen with non-existent CRF
- Tests for get_videos_in_library filtering
- Tests for count_videos accuracy
- Coverage improved from 69.25% to 69.73% (+0.48%)
- Total: 156 tests passing
…y deduplication

- Remove expensive JSON fragment codec checks from CRF search queries
- Simplify queries to use indexed state column only (v.state == :analyzed)
- Add mark_as_crf_searching() and mark_as_encoding() to persist state transitions
- Call state transitions immediately when CRF search/encode starts
- Prevents Broadway infinite loops by marking videos before port opens
- Dashboard queries now timeout-safe and much faster
- Codec filtering happens at Rules.build_args stage instead of query time
- Created Reencodarr.Core.Retry module for centralized retry logic
- Exponential backoff: 100ms → 200ms → 400ms → 800ms → 1600ms
- Added jitter (up to 50% of base delay) to prevent thundering herd
- Applied retry logic to VideoUpsert.batch_upsert for sync operations
- Applied retry logic to VideoStateMachine.mark_as_analyzed
- Improved analyzer error reporting to include failure reasons
- Use centralized Retry module instead of custom requeue mechanism
- Exponential backoff with jitter for database busy errors
- Skips line after exhausting retries instead of blocking the pipeline
- Removes warning spam in logs during high-concurrency sync operations
- Added catch for nested DBConnection.ConnectionError in exit tuples
- Dashboard now gracefully handles client checkout timeouts during sync
- Returns default values (0 or []) instead of crashing when DB is under heavy load
- Prevents error logs when database connection pool is saturated
- Only log throughput metrics when there's recent activity (avg > 0)
- Prevents spam of 'consecutive improvements' when no batches are being processed
- Performance monitor continues to run but stays quiet when idle
- Refactored nested conditions to satisfy credo depth requirements
…yzed

- Check for bitrate, width, height before transition to analyzed state
- Prevents validation errors when videos have existing mediainfo in DB but missing required fields
- Skips transition and logs when fields are missing instead of failing repeatedly
- Videos with missing fields will be reprocessed through mediainfo pipeline
- Refactored to reduce nesting for credo compliance
- Check has_opus_in_filename? in determine_video_transition_decision
- Was missing from the decision tree causing Opus files to be marked as analyzed
- Add debug logging to track transition decisions
- transition_video_to_analyzed → decide_video_processing_path
  Makes it clear this function decides between :analyzed AND :encoded states

- determine_video_transition_decision → check_encoding_requirements
  Pure function that returns {:skip_encoding, reason} or {:needs_encoding, reason}

- persist_analyzed_state/persist_encoded_state → apply_processing_decision
  Single private function handling DB persistence based on decision

- Updated all tests to use new function names and correct assertions

This eliminates the confusing naming where multiple functions with similar
names (transition_video_to_analyzed, persist_analyzed_state,
Media.mark_as_analyzed, VideoStateMachine.mark_as_analyzed) did completely
different things, causing maintenance nightmares.
Copilot AI review requested due to automatic review settings December 5, 2025 19:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses critical architecture improvements focusing on database concurrency, timeout handling, and code clarity. The changes include introducing centralized retry logic with exponential backoff for SQLite "database busy" errors, refactoring the encoder Broadway pipeline to delegate to the Encode GenServer for proper single-worker protection, improving dashboard query resilience with shorter timeouts and safe query wrappers, and comprehensive function renaming in the analyzer for better code maintainability.

Key Changes:

  • Centralized retry module with exponential backoff for database busy errors
  • Dashboard query timeout reduced from 15s to 2s with safe error handling
  • Encoder Broadway delegation to Encode GenServer for single-worker enforcement
  • Function renaming in analyzer for code clarity (determine_video_transition_decisioncheck_encoding_requirements)

Reviewed changes

Copilot reviewed 27 out of 33 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/reencodarr_web/live/dashboard_v2_live_test.exs Updated test assertions from "Video Processing Dashboard" to "Processing Pipeline"
test/reencodarr_web/controllers/page_controller_test.exs Updated test assertion for dashboard content
test/reencodarr/rules_test.exs Added 400+ lines of comprehensive edge case tests for Rules module
test/reencodarr/media_test.exs Added 2100+ lines of comprehensive test coverage for Media module functions
test/reencodarr/media/video_upsert_test.exs Added edge case tests for video upsert operations
test/reencodarr/media/video_state_machine_test.exs Added validation and transition function tests
test/reencodarr/media/exclude_patterns_test.exs Added tests for query functions and pattern matching
test/reencodarr/analyzer/broadway/codec_detection_test.exs Updated test names and assertions for refactored functions
test/reencodarr/ab_av1/encode_test.exs New empty test file placeholder
priv/static/images/* Added new favicon SVG files and compressed versions
mix.lock Updated dependencies (ecto, gettext, phoenix_ecto, req)
lib/reencodarr_web/live/dashboard_live.ex Added timeout handling, async queries, safe query wrappers
lib/reencodarr_web/components/layouts/root.html.heex Added favicon link
lib/reencodarr/media/video_upsert.ex Integrated retry logic for batch operations
lib/reencodarr/media/video_state_machine.ex Added retry logic and new state transition functions
lib/reencodarr/media/video_queries.ex Added timeout options and simplified codec queries
lib/reencodarr/media.ex Updated for SQLite JSON array length and new state functions
lib/reencodarr/encoder/broadway.ex Refactored to delegate encoding to GenServer
lib/reencodarr/core/retry.ex New centralized retry module with exponential backoff
lib/reencodarr/application.ex Added ReencodarrWeb.TaskSupervisor
lib/reencodarr/analyzer/broadway.ex Renamed functions and added mediainfo field validation
lib/reencodarr/analyzer/processing/pipeline.ex Enhanced error message format with reasons
lib/reencodarr/analyzer/core/file_operations.ex Added MP4 file skip logic
lib/reencodarr/analyzer/broadway/performance_monitor.ex Fixed idle spam by only logging when throughput > 0
lib/reencodarr/ab_av1/encode.ex Mark video as encoding before starting port
lib/reencodarr/ab_av1/crf_search.ex Integrated retry logic for VMAF upserts
lib/reencodarr/ab_av1.ex Added MP4 file skip logic
config/dev.exs Added timeout and queue configuration
flake.nix Updated to build Erlang 28.1.1 from source
flake.lock Updated nixpkgs revision

@mjc mjc merged commit c6d069b into main Dec 5, 2025
7 checks passed
@mjc mjc deleted the fix/dashboard-query-timeout branch December 5, 2025 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants