-
Notifications
You must be signed in to change notification settings - Fork 3
Description
TL;DR
Fine-tuned EmbeddingGemma-300M — the embedding model powering QMD search — on 420 Smriti coding sessions. Generated 1,700 training triplets using Gemini 2.0 Flash, trained on a free-tier Colab T4 GPU after failing on local M3 Pro (MPS OOM). Result: accuracy 87.3% → 91.5% (+4.2pp), margin +43% relative. The model now understands domain terms like "LoRA rank", "RRF fusion", and "OpenFGA" instead of treating them as generic text.
The Idea
QMD uses a generic 300M-parameter embedding model. It doesn't know what "LoRA rank" means, or that "RRF" is about search fusion, or that when you say "auth" you mean OpenFGA — not OAuth. smriti recall and smriti search suffer because of this vocabulary mismatch.
Fine-tuning on actual sessions teaches the model our vocabulary. We generate (query, relevant passage, hard negative) triplets from real sessions, then train the model to push relevant results closer together and irrelevant ones apart.
Timeline
| When | What |
|---|---|
| Feb 12, 4:44 PM | Built the full pipeline: export sessions → generate triplets → validate → train → eval → convert GGUF. First commit 29df52b. |
| Feb 12, evening | Tried Ollama (qwen3:8b) for triplet generation. Too slow for 420 sessions — would take hours locally. |
| Feb 12–13 | Switched to Gemini 2.0 Flash API. Fast and cheap. Generated 2,069 raw triplets → 1,700 after validation/dedup. |
| Feb 13, morning | Attempted local training on M3 Pro (18GB). OOM immediately with seq_length: 512, batch_size: 8. Reduced batch size, seq length, disabled fp16, switched loss function. Still OOM. |
| Feb 13, ~10:00 AM | Pivoted to Google Colab (T4 GPU, 15GB VRAM, free tier) |
| Feb 13, 10:00–10:44 AM | 6+ failed Colab runs. T4 OOM with initial settings. Progressively lowered seq_length (512→256→128), added gradient checkpointing, tuned mini_batch_size, set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. |
| Feb 13, 10:44 AM | First successful training run. Commit 6af8a2b. |
| Feb 13, shortly after | Evaluation: accuracy 87.3% → 91.5%, margin +43% relative. |
What Failed & What Fixed It
| Failure | Root Cause | Fix |
|---|---|---|
| Ollama triplet generation too slow | qwen3:8b running locally on CPU, 420 sessions |
Switched to Gemini 2.0 Flash API |
| MPS OOM on M3 Pro (18GB) | seq_length: 512, batch_size: 8, fp16 on MPS |
Reduced to seq_length: 256, batch_size: 2, disabled fp16, added gradient accumulation |
| Still OOM on MPS after reductions | MPS memory management fundamentally limited for training | Pivoted to Colab T4 |
| T4 OOM on Colab (attempts 1–6) | seq_length: 256, no gradient checkpointing, mini_batch too large |
seq_length: 128, gradient checkpointing, mini_batch_size: 4, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
The Pipeline
smriti DB (420 sessions)
→ export_sessions.py → sessions.jsonl (7.9 MB)
→ generate_triplets.py (Gemini 2.0 Flash) → triplets.jsonl (2,069 triplets)
→ validate_data.py → train.jsonl (1,700) + val.jsonl (165)
→ train.py (sentence-transformers + CachedMNRL loss) → fine-tuned model
→ eval.py → metrics comparison
→ convert_gguf.py → GGUF for QMD
Each triplet contains:
- Query: 2–8 word search query (what a user would type into
smriti search) - Positive: 50–300 word relevant passage from the session
- Hard negative: A passage from the same conversation that's topically related but answers a different question
Train/val split is by session (not by triplet) to prevent data leakage.
Results
Base Model Fine-Tuned Change
Accuracy 0.8727 0.9152 +0.0424 (+4.9%)
Margin 0.1716 0.2452 +0.0736 (+42.9%)
Positive Sim 0.5608 0.5226 -0.0382
Negative Sim 0.3893 0.2774 -0.1119
Both positive and negative similarity dropped, but negative similarity dropped 3x harder (0.39 → 0.28 vs 0.56 → 0.52). The model learned to push irrelevant results far apart while keeping relevant ones close. This is exactly what you want for retrieval — fewer false positives, cleaner separation.
Final Working Colab Config
| Parameter | Value |
|---|---|
max_seq_length |
128 |
per_device_train_batch_size |
4 |
gradient_accumulation_steps |
16 (effective batch = 64) |
mini_batch_size (CachedMNRL) |
4 |
num_train_epochs |
3 |
learning_rate |
2e-5 |
gradient_checkpointing |
true |
fp16 |
true |
What's Next
The end state isn't a separate repo — it's smriti finetune:
smriti finetune— Subcommand that retrains the embedding model on accumulated sessions. Run after a week of coding, on a cron, or as a post-ingest hook.smriti finetune --incremental— Don't retrain from scratch. Keep the last checkpoint and continue on new sessions only. The model accumulates knowledge over time.smriti finetune --team— Pull sessions from teammates viasmriti sync, train a shared model. The team's collective vocabulary becomes the model's vocabulary.- Reranker fine-tuning — QMD uses a 0.6B reranker (Qwen3-Reranker). Same triplet data, different training objective. Would compound the embedding improvements.
- Automatic quality signals — Use implicit signals from actual usage (clicked results = positive, reformulated queries = hard negatives) instead of synthetic LLM-generated triplets.
- Per-project adapters — Train project-specific LoRA adapters (~8MB each) that QMD swaps based on active project.
- Scheduled retraining — Weekly cron that runs
smriti finetune --incremental --deploy. Search silently gets better every Monday.