Skip to content

Conversation

@danielunderwood
Copy link
Contributor

No description provided.

danielunderwood and others added 15 commits June 26, 2025 16:53
Add comprehensive development documentation for Claude Code instances
including setup commands, architecture overview, and key components.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add FastAPI, Pydantic for async API framework
- Update aiohttp to ^3.8.0 for better async compatibility
- Update async-timeout to ^4.0.0 for aiohttp compatibility
- Update yarl to ^1.17.0 for aiohttp compatibility
- Update typing-extensions to ^4.6.1 for Pydantic 2.x support
- Create basic FastAPI app with health endpoint and middleware
- Add migration tracking document

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update sentry-sdk to ^1.9.0 for FastAPI integration compatibility
- Implement hybrid routing system to direct traffic between FastAPI/Quart
- Migrate root endpoint (/) from Quart to FastAPI with identical functionality
- Add feature flag system for gradual endpoint migration
- Successfully test FastAPI app import and route detection
- Complete Phase 1 dual setup for incremental migration

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Allow Azure Pipelines to trigger on feature/* branches in addition
to master and develop, enabling builds for feature branch development.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Update Dockerfile to use uvicorn as ASGI server with the new hybrid
FastAPI/Quart application. Maintains backwards compatibility with
port 5001 and binding to all interfaces.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add structlog ^23.2.0 for improved logging capabilities and async
debugging. No dependency conflicts detected.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove manual StreamHandler additions from app.py and provider.py
- Remove conflicting logging.basicConfig() call from server.py
- Clean up unused logging imports
- Reduces duplicate log output when using structlog

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create logging configuration module with structlog integration
- Configure structlog with JSON logging for production and human-readable for debug
- Add async operation logging helper for timing and debugging
- Update FastAPI app to use structured logging
- Integrate with stdlib logging for backwards compatibility

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add slumber.yml for interactive API testing with profiles for local
development and production environments. Includes search endpoints
for testing the API functionality.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Rewrite logging configuration from scratch using structlog best practices:
- Clean JSON output for production (single-line, proper format)
- Human-readable colored output for development
- Proper processor chain following structlog documentation
- Includes: timestamp, level, logger, filename, function, line number
- Eliminates nested JSON and formatting issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Convert all logging.getLogger() calls to use structured logging:
- api.py: Convert to get_logger() and remove manual handlers
- cache.py: Convert to get_logger() and remove manual handlers
- crawler.py: Convert to get_logger() and remove manual handlers
- util.py: Convert to get_logger() and remove manual handlers
- hybrid_app.py: Convert to get_logger()
- provider.py: Convert to get_logger() and remove manual handlers

All modules now use centralized structured logging with:
- Clean JSON output with source location info
- Consistent timestamp and level formatting
- No duplicate log handlers or output
- Enhanced debugging context for async issues

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Created debug_async_operation decorator to instrument provider methods
- Added timing and operation logging for all major provider calls:
  * TheAudioDbProvider: get_artist_images, get_artist_overview, get_by_mbid
  * FanArtTvProvider: get_artist_images, get_album_images, get_by_mbid
  * MusicbrainzDbProvider: get_artists_by_id, get_release_groups_by_id
  * WikipediaProvider: get_artist_overview
  * SolrSearchProvider: search_artist_name, search_album_name
- Provides detailed async performance debugging with provider name, operation, timing, and error info
- Helps identify async bottlenecks and slow external API calls

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add pydantic-settings dependency for type-safe configuration
- Create LoggingSettings with LOG_LEVEL and LOG_FORMAT environment variables
- Support log levels: debug, info, warning/warn, error, critical
- Support log formats: json, text
- Update logging_config.py to use new settings with backward compatibility
- Update fastapi_app.py to use new logging configuration
- Added Python version upgrade to todo list

Environment variables:
- LOG_LEVEL=debug|info|warning|error|critical (default: info)
- LOG_FORMAT=json|text (default: json)

Examples:
- LOG_LEVEL=debug LOG_FORMAT=text - Human readable debug logs
- LOG_LEVEL=info LOG_FORMAT=json - Production JSON logs

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Document new LOG_LEVEL and LOG_FORMAT environment variables
- Provide examples for development, production, and troubleshooting
- Document provider debug instrumentation capabilities
- Include migration guide from legacy DEBUG configuration

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update certifi from 2022.12.7 to ^2024.0.0 for latest root certificates
- Add explicit SSL context with certifi CA bundle to all aiohttp sessions
- Fix SSL verification errors with external APIs (theaudiodb.com, fanart.tv, etc.)
- Update HttpProvider._get_session() to use ssl.create_default_context()
- Update all crawler.py aiohttp sessions with proper SSL configuration

Resolves SSL certificate verification errors:
"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate"

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@blampe
Copy link

blampe commented Jun 28, 2025

@danielunderwood what problem is this solving?

The existing code is already compatible with the latest MB schema. The vanilla MB images are also compatible; forking isn't needed to make them work with only album/release indexing.

Why not update the sir container instead? It's still on the 2024 tag/schema.

@danielunderwood
Copy link
Contributor Author

@danielunderwood what problem is this solving?

The existing code is already compatible with the latest MB schema. The vanilla MB images are also compatible; forking isn't needed to make them work with only album/release indexing.

Why not update the sir container instead? It's still on the 2024 tag/schema.

The title here is a bit misleading. The larger goal here is to do some experimentation and add debugging (hence the logging changes) on the backend to try to identify and fix the main issue while working on a rewrite. FastAPI has been introduced because I suspect that something weird is going on with async, so I am trying to simplify the code and make the async pieces a bit more clear.

Running locally and connecting to our deployed MB mirror and solr, I don't encounter these issues, so debugging is difficult. The debug logging has revealed that solr connections are working well on search, which was one of my suspects, but doesn't appear to actually be an issue.

The biggest actual fix so far is updating certifi in a0ab5ca, resolving some SSL connection issues, but not altogether resolving the issues overall.

@blampe
Copy link

blampe commented Jun 28, 2025

@danielunderwood see my latest comment in lidarr-testing. Have you confirmed all containers have been updated? Because if sir is out of date things will obviously not work.

The stack is functional without any code changes (https://github.com/blampe/hearring-aid) so you can understand my skepticism.

I can give you the tags and commands to run to put things back into a good state.

danielunderwood and others added 12 commits July 5, 2025 14:23
- Switch from Alpine to Bookworm base image for better compatibility
- Implement multi-stage dependency installation to leverage Docker layer caching
- Copy pyproject.toml and poetry.lock first, then install dependencies
- Use Poetry 1.4.2 to avoid compatibility issues with requests 2.25.1
- Copy application code after dependency installation to maximize cache hits

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ions

- Fix get_artist_info_multi: Replace asyncio.gather with asyncio.wait for timeout support
- Add proper task-to-artist mapping to prevent KeyError on task completion
- Fix "not enough values to unpack" errors with result validation before unpacking
- Add comprehensive error handling for provider failures and empty results

- Fix get_album_search_results: Add timeout to prevent hanging album searches
- Replace asyncio.gather with asyncio.wait for proper timeout handling
- Add task cancellation and error logging for failed album search tasks

- Fix get_release_group_artists: Add return_exceptions and result validation
- Fix get_release_group_info_multi: Add timeout and proper async task handling
- Fix get_overview: Add error handling for providers returning empty results

- Add robust type checking and hasattr validation throughout
- Add detailed logging for debugging async operation failures
- Ensure all async operations have proper timeout and cancellation handling

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ction

- Add execute_async_tasks_with_timeout utility function to reduce code duplication
- Replace repeated asyncio.wait patterns with centralized error handling
- Refactor get_artist_info_multi to use utility for overview and image tasks
- Refactor get_album_search_results to use utility for search tasks
- Remove overly restrictive tuple size validation for flexibility
- Maintain same timeout behavior while significantly reducing code duplication

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…enhanced logging

- Convert /artist/<mbid> endpoint to use execute_async_tasks_with_timeout utility
- Convert /album/<mbid> endpoint to use execute_async_tasks_with_timeout utility
- Add detailed logging for timed out coroutines with function names
- Log specific coroutines being cancelled on timeout for better debugging
- Return 504 Gateway Timeout status when requests exceed 10 second timeout
- Maintain same functionality while preventing hanging requests

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add async_tracker.py module for operation tracking and hanging detection
- Add circuit_breaker.py module for external service failure protection
- Convert remaining asyncio.gather calls to use timeout utility in api.py
- Add async operation tracking to get_artist_info() with context logging
- Add /health/async endpoint to FastAPI app for real-time monitoring
- Replace hanging asyncio.gather in get_release_group_artists and image fetching
- Provide detailed visibility into timed out coroutines and circuit breaker stats

Improves reliability by:
- Preventing cascading failures with circuit breaker pattern
- Tracking all async operations with timing and context
- Identifying hanging operations in real-time
- Enhanced logging with operation names and failure details

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add operation_tracker import to app.py for future use
- Register /health/async endpoint in hybrid_app.py routing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Convert search_all() to use execute_async_tasks_with_timeout instead of asyncio.gather
- Convert search_fingerprint() to use timeout utility for album info gathering
- Add 20s timeout for combined artist/album search operations
- Add 15s timeout for fingerprint-based album searches
- Provide fallback empty results when search operations timeout

Fixes hanging search requests that were timing out without proper error handling.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Increase album info timeout from 10s to 20s for database-heavy operations
- Add async operation tracking to get_release_group_info and get_release_group_info_basic
- Provide better visibility into which part of album fetching is hanging

The previous 10s timeout was too short for album operations that involve multiple
database queries and artist lookups. The new 20s timeout with detailed tracking
will help identify exactly where operations are hanging.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add circuit breaker protection to get_release_groups_by_id database query
- Configure 15s timeout, 3 failure threshold, 30s recovery for database operations
- Prevent cascading failures when MusicBrainz database becomes unresponsive
- Circuit breaker will fail fast after 3 consecutive database timeouts

This addresses the hanging database queries that were causing 504 timeouts
in album info endpoints by providing fail-fast behavior and automatic recovery.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…eanup

- Add database connection timeouts: 10s command_timeout and server-side timeouts
- Add background task to force-cleanup operations hanging >30s past their timeout
- Add manual cleanup endpoint POST /health/async/cleanup for emergency cleanup
- Add PostgreSQL server settings for statement_timeout and idle timeouts

Addresses the issue where operations were hanging for 7+ minutes by:
1. Adding strict database-level timeouts
2. Automatically cleaning up tracker entries for operations that exceed timeouts
3. Providing manual cleanup capability for debugging

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…dling

- Add comprehensive Pydantic models in models.py for all API responses
- Update circuit_breaker.py to return structured CircuitBreakerInfo models
- Add proper return type annotations to all FastAPI endpoints
- Extract nested response fields into dedicated models (HangingOperationDetail, etc.)
- Add proper HTTP status codes for ArtistNotFoundException and ReleaseGroupNotFoundException
- Improve async tracker to distinguish expected vs unexpected exceptions
- Use ErrorResponse model for consistent error formatting

Benefits:
- Better API documentation through FastAPI's automatic schema generation
- Type safety and validation for all responses
- Consistent error response format
- Cleaner separation between expected exceptions (404) and actual errors (500)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Reorganize models into logical packages: base, monitoring, music
- Use pythonic naming with field aliases for API compatibility
- Migrate /artist/<mbid> endpoint from Quart to FastAPI with full typing
- Add UUID validation and proper error handling
- Maintain cache control headers and filtering compatibility
- Add artist endpoint to hybrid app routing

Model organization:
- models/base.py: ErrorResponse, HealthResponse, InfoResponse
- models/monitoring.py: Circuit breaker and async operation models
- models/music.py: Artist, Album models with proper aliases
- models/__init__.py: Convenient imports

Benefits:
- Better code organization and maintainability
- Full type safety with Pydantic models
- Automatic API documentation generation
- Consistent error handling and validation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
danielunderwood and others added 3 commits July 6, 2025 02:29
… BaseSettings

- Add async_settings.py with AsyncTimeoutSettings using pydantic-settings
- Replace all hardcoded timeouts with configurable values from settings
- Support environment variable overrides with ASYNC_TIMEOUT_ prefix
- Add operation-specific timeouts for different types of operations

Timeout configuration:
- ASYNC_TIMEOUT_ARTIST_INFO: Artist info + albums (default: 15s)
- ASYNC_TIMEOUT_ALBUM_INFO: Album info with DB queries (default: 20s)
- ASYNC_TIMEOUT_SEARCH_ALL: Combined search operations (default: 20s)
- ASYNC_TIMEOUT_DATABASE_QUERY: Database operations (default: 10s)
- ASYNC_TIMEOUT_EXTERNAL_API: External API calls (default: 10s)

Benefits:
- Environment-specific timeout tuning
- Type-safe configuration with Pydantic
- Centralized timeout management
- Easy debugging and performance optimization

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…iagnosis

Implement multi-level debugging strategy to identify and resolve persistent
/artist/<mbid> timeout issues:

**Database Performance Monitoring:**
- Enhanced query profiling with timing and slow query detection
- EXPLAIN ANALYZE for critically slow queries (>90% timeout threshold)
- Connection pool monitoring with acquisition metrics
- Database metrics available via /debug/database endpoint

**Granular Async Operation Tracking:**
- Step-by-step tracking of get_artist_info operations:
  - database_artist_lookup: Main DB query
  - artist_overviews_batch: Wikipedia/Wikidata fetching
  - artist_images_primary/secondary: Image provider calls
- Enhanced async tracker with operation context and timeouts
- Real-time hanging operation monitoring

**Provider-Specific Performance Analysis:**
- Enhanced debug_async_operation decorator with timeout categorization
- Provider operation timing with 70% threshold warnings
- Automatic timeout assignment based on operation type
- Provider context in all async operations

**Debug Endpoints:**
- /debug/artist/<mbid>: Comprehensive artist debugging info
- /debug/database: Real-time database performance metrics
- /debug/operations/hanging: Current hanging operations detail
- /health/async: Enhanced with circuit breaker and failure tracking

**Supporting Infrastructure:**
- db_monitor.py: Database connection and query performance tracking
- DEBUGGING_GUIDE.md: Step-by-step troubleshooting documentation
- Enhanced hybrid app routing for debug endpoints

This infrastructure provides exact visibility into where the 45-second timeout
is being consumed, enabling targeted performance optimization.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…_date

The InfoResponse model expects replication_date as a string, but data_vintage()
returns a datetime object. Convert to ISO format string to match the model.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants