Convert YouTube videos into editable sheet music using AI.
Status: Live and operational! (for best results, use the local deployment)
- Backend: https://calebhan-rescored.hf.space
- Frontend: https://rescored.vercel.app
Rescored transcribes YouTube videos to professional-quality music notation:
- Submit a YouTube URL
- AI Processing extracts audio, separates instruments, and transcribes to MIDI
- Edit the notation in an interactive editor
- Export as MIDI
Tech Stack:
- Backend: Python/FastAPI + Celery + Redis
- Frontend: React + VexFlow (notation) + Tone.js (playback)
- ML Pipeline:
- BS-RoFormer (vocal removal) → Demucs (6-stem separation)
- YourMT3+ + ByteDance ensemble → BiLSTM refinement (96.1% accuracy on piano)
- Audio preprocessing + confidence filtering
For local development setup and installation instructions, see CONTRIBUTING.md
- YouTube URL input with validation and health checks
- Multi-instrument transcription (piano, guitar, bass, drums, vocals, other)
- Advanced source separation (BS-RoFormer + Demucs 6-stem)
- Ensemble transcription (YourMT3+ + ByteDance voting system)
- BiLSTM neural refinement (96.1% F1 accuracy on piano)
- Audio preprocessing (noise reduction, spectral denoising)
- Confidence filtering (frame-level ByteDance scores)
- Interactive notation editor with VexFlow rendering
- Multi-instrument tabs (switch between transcribed instruments)
- Playback controls (play/pause, tempo adjust, loop)
- Real-time progress via WebSocket
- MIDI export (download transcribed notation)
- Grand staff support (treble + bass clefs)
- Chord detection and rendering
- Note selection and highlighting
- Advanced editing (copy/paste, drag-to-reposition, undo/redo)
- PDF export
- Articulations and dynamics notation
rescored/
├── backend/ # Python/FastAPI backend
│ ├── main.py # REST API + WebSocket server
│ ├── tasks.py # Celery background workers
│ ├── pipeline.py # Audio processing pipeline
│ ├── app_config.py # Configuration settings
│ ├── app_utils.py # Utility functions
│ ├── audio_preprocessor.py # Audio enhancement pipeline
│ ├── ensemble_transcriber.py # Multi-model voting system
│ ├── confidence_filter.py # Post-processing filters
│ ├── key_filter.py # Music theory filters
│ ├── requirements.txt # Python dependencies (including tests)
│ ├── tests/ # Test suite (59 tests, 27% coverage)
│ │ ├── test_api.py # API endpoint tests
│ │ ├── test_pipeline.py # Pipeline component tests
│ │ ├── test_tasks.py # Celery task tests
│ │ └── test_utils.py # Utility function tests
│ └── ymt/ # YourMT3+ model and wrappers
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── store/ # Zustand state management
│ │ └── api/ # API client
│ └── package.json # Node dependencies
├── docs/ # Comprehensive documentation
│ ├── backend/ # Backend implementation guides
│ ├── frontend/ # Frontend implementation guides
│ ├── architecture/ # System design documents
│ └── research/ # ML model comparisons
├── logs/ # Runtime logs (created by start.sh)
├── storage/ # YouTube cookies and temp files
├── start.sh # Start all services
├── stop.sh # Stop all services
└── docker-compose.yml # Docker setup (optional)
Production Configuration (Phase 1.3 - Ensemble + BiLSTM):
- Piano transcription: 96.1% F1 score (evaluated on MAESTRO test set)
- Full pipeline: YourMT3+ + ByteDance ensemble → Confidence filtering → BiLSTM refinement
- Includes audio preprocessing, two-stage source separation, and neural post-processing
- Enabled by default in
app_config.py
Alternative Configurations:
- Ensemble only (no BiLSTM): 93.6% F1 - faster, still very accurate
- YourMT3+ only: ~85% F1 - generalist model
- basic-pitch (fallback): ~70% F1 - lightweight backup
The interactive editor is designed to make fixing remaining errors easy regardless of which transcription configuration is used.
Hardware Requirements:
- BiLSTM refinement: ~100MB checkpoint, works on CPU/GPU/MPS
- ByteDance ensemble: ~4GB VRAM (may fall back to YourMT3+ only on systems with limited GPU memory)
Evaluated on MAESTRO test set (177 piano recordings):
| Configuration | F1 Score | Precision | Recall | Description |
|---|---|---|---|---|
| Baseline | 93.1% | 89.7% | 96.8% | Ensemble only (YourMT3+ + ByteDance) |
| Phase 1.1 (Confidence) | 93.6% | 91.5% | 95.7% | + ByteDance confidence filtering |
| Phase 1.2 (TTA) | 81.0% | 70.9% | 94.8% | + Test-time augmentation (broken) |
| Phase 1.3 (BiLSTM) | 96.1% | 96.7% | 95.5% | Ensemble + Confidence + BiLSTM |
| Phase 1.3b (BiLSTM only) | 96.0% | 95.4% | 96.6% | YourMT3+ → BiLSTM (no ensemble) |
| Phase 1.3c (ByteDance + BiLSTM) | 96.0% | 96.3% | 95.7% | ByteDance → BiLSTM (no ensemble) |
✅ What Worked:
- BiLSTM refinement (+2.5% F1): Neural post-processor improves transcription from 93.6% → 96.1% F1
- Phase 1.3 (Ensemble + Confidence + BiLSTM): 96.1% F1 (best configuration)
- Phase 1.3b (YourMT3+ → BiLSTM): 96.0% F1 (simpler, nearly as good)
- Phase 1.3c (ByteDance → BiLSTM): 96.0% F1 (simpler, nearly as good)
- All three BiLSTM variants perform nearly identically (~96% F1)
- BiLSTM successfully learns timing corrections and false positive filtering
- Reliability: Chunked processing handles long sequences (7000+ notes) that exceed cuDNN LSTM limits
- Confidence filtering (+0.5% F1): Using ByteDance's frame-level confidence scores to filter low-confidence notes
- Ensemble voting (93.1% → 93.6%): Combining YourMT3+ (generalist) + ByteDance (piano specialist) with asymmetric weights
❌ What Failed:
- Test-Time Augmentation (-12.6% F1): Pitch shift/time stretch augmentations produce misaligned predictions
- 67-72% of notes appear in only 1 of 5 augmentations
- Vote counting filtered out too many correct predictions
- Precision dropped dramatically (91.5% → 70.9%)
- Root cause: Augmentations change model behavior non-linearly, not just adding noise
Current Production Setup (Phase 1.3):
- Configuration: Ensemble + Confidence + BiLSTM → 96.1% F1
- Enabled in
app_config.py:use_ensemble_transcription = True use_bytedance_confidence = True enable_bilstm_refinement = True enable_tta = False # Disabled (proven ineffective)
- Full pipeline: YourMT3+ + ByteDance ensemble → Confidence filtering → BiLSTM refinement
- Processing time: ~2-3 minutes per song on GPU
Alternative Configurations (96.0% F1):
- Phase 1.3b (YourMT3+ → BiLSTM): Simpler, faster, no ByteDance loading
- Phase 1.3c (ByteDance → BiLSTM): Piano specialist path
- Both achieve nearly identical accuracy with reduced complexity
Key Insight:
- BiLSTM post-processing was the breakthrough: +2.5% F1 improvement (93.6% → 96.1%)
- All BiLSTM variants (1.3, 1.3b, 1.3c) perform nearly identically at ~96% F1
- This suggests BiLSTM is the key component, not the upstream transcriber
- Simpler pipelines (1.3b, 1.3c) may be preferable for production due to lower complexity
For Future Research:
- Investigate why all BiLSTM variants achieve ~96% regardless of upstream model
- Try training BiLSTM with more epochs (current: 50, suggested: 100)
- Explore Phase 2 (D3RM diffusion refinement) for potential 97-99% F1
- Piano transcription with 96.1% F1 (ensemble + confidence filtering + BiLSTM)
- Two-stage source separation (BS-RoFormer + Demucs)
- Audio preprocessing pipeline
- Enhanced confidence filtering (+0.5% F1)
- BiLSTM neural refinement (+2.5% F1)
- Vocal transcription support (piano + vocals)
- Basic editing capabilities
- MusicXML export
- Test suite (59 tests, 27% coverage)
- Benchmark evaluation on MAESTRO dataset (177 examples)
- Production deployment with optimal configuration
- Try training BiLSTM with 100 epochs (currently 50, may reach ~97% F1)
- Simplify to Phase 1.3b (YourMT3+ → BiLSTM) for faster processing
- Investigate why BiLSTM achieves 96% regardless of upstream model
- Multi-instrument transcription beyond piano+vocals
- Grand staff notation (treble + bass)
- Advanced editing (copy/paste, undo/redo, multi-select)
- MIDI export improvements
- PDF export
- Articulations and dynamics
- User accounts and authentication
- Cloud storage integration
- Job history and saved transcriptions
- Collaboration features
MIT License - see LICENSE for details.
- YourMT3+ (KAIST) - Multi-instrument music transcription (Paper)
- ByteDance Piano Transcription - Piano-specific CNN+BiGRU model (GitHub)
- BS-RoFormer - Vocal removal for cleaner separation (GitHub)
- Demucs (Meta AI Research) - 6-stem audio source separation (Paper)
- audio-separator - BS-RoFormer wrapper and audio processing utilities
- librosa - Audio preprocessing and feature extraction
- madmom - Beat tracking and tempo detection
- pretty_midi - MIDI file manipulation
- VexFlow - Music notation rendering in SVG/Canvas
- Tone.js - Web audio synthesis and playback
Note: This is an educational project. Users are responsible for copyright compliance when transcribing YouTube content.