Skip to content

Fine-tuned EmbeddingGemma-300M on Smriti sessions — journey, results, and next steps #17

@ashu17706

Description

@ashu17706

TL;DR

Fine-tuned EmbeddingGemma-300M — the embedding model powering QMD search — on 420 Smriti coding sessions. Generated 1,700 training triplets using Gemini 2.0 Flash, trained on a free-tier Colab T4 GPU after failing on local M3 Pro (MPS OOM). Result: accuracy 87.3% → 91.5% (+4.2pp), margin +43% relative. The model now understands domain terms like "LoRA rank", "RRF fusion", and "OpenFGA" instead of treating them as generic text.

The Idea

QMD uses a generic 300M-parameter embedding model. It doesn't know what "LoRA rank" means, or that "RRF" is about search fusion, or that when you say "auth" you mean OpenFGA — not OAuth. smriti recall and smriti search suffer because of this vocabulary mismatch.

Fine-tuning on actual sessions teaches the model our vocabulary. We generate (query, relevant passage, hard negative) triplets from real sessions, then train the model to push relevant results closer together and irrelevant ones apart.

Timeline

When What
Feb 12, 4:44 PM Built the full pipeline: export sessions → generate triplets → validate → train → eval → convert GGUF. First commit 29df52b.
Feb 12, evening Tried Ollama (qwen3:8b) for triplet generation. Too slow for 420 sessions — would take hours locally.
Feb 12–13 Switched to Gemini 2.0 Flash API. Fast and cheap. Generated 2,069 raw triplets → 1,700 after validation/dedup.
Feb 13, morning Attempted local training on M3 Pro (18GB). OOM immediately with seq_length: 512, batch_size: 8. Reduced batch size, seq length, disabled fp16, switched loss function. Still OOM.
Feb 13, ~10:00 AM Pivoted to Google Colab (T4 GPU, 15GB VRAM, free tier)
Feb 13, 10:00–10:44 AM 6+ failed Colab runs. T4 OOM with initial settings. Progressively lowered seq_length (512→256→128), added gradient checkpointing, tuned mini_batch_size, set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Feb 13, 10:44 AM First successful training run. Commit 6af8a2b.
Feb 13, shortly after Evaluation: accuracy 87.3% → 91.5%, margin +43% relative.

What Failed & What Fixed It

Failure Root Cause Fix
Ollama triplet generation too slow qwen3:8b running locally on CPU, 420 sessions Switched to Gemini 2.0 Flash API
MPS OOM on M3 Pro (18GB) seq_length: 512, batch_size: 8, fp16 on MPS Reduced to seq_length: 256, batch_size: 2, disabled fp16, added gradient accumulation
Still OOM on MPS after reductions MPS memory management fundamentally limited for training Pivoted to Colab T4
T4 OOM on Colab (attempts 1–6) seq_length: 256, no gradient checkpointing, mini_batch too large seq_length: 128, gradient checkpointing, mini_batch_size: 4, PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

The Pipeline

smriti DB (420 sessions)
    → export_sessions.py → sessions.jsonl (7.9 MB)
    → generate_triplets.py (Gemini 2.0 Flash) → triplets.jsonl (2,069 triplets)
    → validate_data.py → train.jsonl (1,700) + val.jsonl (165)
    → train.py (sentence-transformers + CachedMNRL loss) → fine-tuned model
    → eval.py → metrics comparison
    → convert_gguf.py → GGUF for QMD

Each triplet contains:

  • Query: 2–8 word search query (what a user would type into smriti search)
  • Positive: 50–300 word relevant passage from the session
  • Hard negative: A passage from the same conversation that's topically related but answers a different question

Train/val split is by session (not by triplet) to prevent data leakage.

Results

                    Base Model    Fine-Tuned    Change
Accuracy            0.8727        0.9152        +0.0424 (+4.9%)
Margin              0.1716        0.2452        +0.0736 (+42.9%)
Positive Sim        0.5608        0.5226        -0.0382
Negative Sim        0.3893        0.2774        -0.1119

Both positive and negative similarity dropped, but negative similarity dropped 3x harder (0.39 → 0.28 vs 0.56 → 0.52). The model learned to push irrelevant results far apart while keeping relevant ones close. This is exactly what you want for retrieval — fewer false positives, cleaner separation.

Final Working Colab Config

Parameter Value
max_seq_length 128
per_device_train_batch_size 4
gradient_accumulation_steps 16 (effective batch = 64)
mini_batch_size (CachedMNRL) 4
num_train_epochs 3
learning_rate 2e-5
gradient_checkpointing true
fp16 true

What's Next

The end state isn't a separate repo — it's smriti finetune:

  • smriti finetune — Subcommand that retrains the embedding model on accumulated sessions. Run after a week of coding, on a cron, or as a post-ingest hook.
  • smriti finetune --incremental — Don't retrain from scratch. Keep the last checkpoint and continue on new sessions only. The model accumulates knowledge over time.
  • smriti finetune --team — Pull sessions from teammates via smriti sync, train a shared model. The team's collective vocabulary becomes the model's vocabulary.
  • Reranker fine-tuning — QMD uses a 0.6B reranker (Qwen3-Reranker). Same triplet data, different training objective. Would compound the embedding improvements.
  • Automatic quality signals — Use implicit signals from actual usage (clicked results = positive, reformulated queries = hard negatives) instead of synthetic LLM-generated triplets.
  • Per-project adapters — Train project-specific LoRA adapters (~8MB each) that QMD swaps based on active project.
  • Scheduled retraining — Weekly cron that runs smriti finetune --incremental --deploy. Search silently gets better every Monday.

Repo

https://github.com/zero8dotdev/smriti-getting-smarter

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions