Fine-tuned EmbeddingGemma-300M on Smriti sessions — journey, results, and next steps

## TL;DR

Fine-tuned [EmbeddingGemma-300M](https://huggingface.co/google/embeddinggemma-300m) — the embedding model powering QMD search — on 420 Smriti coding sessions. Generated 1,700 training triplets using Gemini 2.0 Flash, trained on a free-tier Colab T4 GPU after failing on local M3 Pro (MPS OOM). Result: **accuracy 87.3% → 91.5% (+4.2pp), margin +43% relative**. The model now understands domain terms like "LoRA rank", "RRF fusion", and "OpenFGA" instead of treating them as generic text.

## The Idea

QMD uses a generic 300M-parameter embedding model. It doesn't know what "LoRA rank" means, or that "RRF" is about search fusion, or that when you say "auth" you mean OpenFGA — not OAuth. `smriti recall` and `smriti search` suffer because of this vocabulary mismatch.

Fine-tuning on actual sessions teaches the model *our* vocabulary. We generate (query, relevant passage, hard negative) triplets from real sessions, then train the model to push relevant results closer together and irrelevant ones apart.

## Timeline

| When | What |
|------|------|
| **Feb 12, 4:44 PM** | Built the full pipeline: export sessions → generate triplets → validate → train → eval → convert GGUF. First commit [`29df52b`](https://github.com/zero8dotdev/smriti-getting-smarter/commit/29df52b). |
| **Feb 12, evening** | Tried Ollama (`qwen3:8b`) for triplet generation. Too slow for 420 sessions — would take hours locally. |
| **Feb 12–13** | Switched to Gemini 2.0 Flash API. Fast and cheap. Generated 2,069 raw triplets → 1,700 after validation/dedup. |
| **Feb 13, morning** | Attempted local training on M3 Pro (18GB). OOM immediately with `seq_length: 512, batch_size: 8`. Reduced batch size, seq length, disabled fp16, switched loss function. Still OOM. |
| **Feb 13, ~10:00 AM** | Pivoted to Google Colab (T4 GPU, 15GB VRAM, free tier) |
| **Feb 13, 10:00–10:44 AM** | 6+ failed Colab runs. T4 OOM with initial settings. Progressively lowered seq_length (512→256→128), added gradient checkpointing, tuned mini_batch_size, set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`. |
| **Feb 13, 10:44 AM** | First successful training run. Commit [`6af8a2b`](https://github.com/zero8dotdev/smriti-getting-smarter/commit/6af8a2b). |
| **Feb 13, shortly after** | Evaluation: accuracy 87.3% → 91.5%, margin +43% relative. |

## What Failed & What Fixed It

| Failure | Root Cause | Fix |
|---------|-----------|-----|
| Ollama triplet generation too slow | `qwen3:8b` running locally on CPU, 420 sessions | Switched to Gemini 2.0 Flash API |
| MPS OOM on M3 Pro (18GB) | `seq_length: 512`, `batch_size: 8`, fp16 on MPS | Reduced to `seq_length: 256`, `batch_size: 2`, disabled fp16, added gradient accumulation |
| Still OOM on MPS after reductions | MPS memory management fundamentally limited for training | Pivoted to Colab T4 |
| T4 OOM on Colab (attempts 1–6) | `seq_length: 256`, no gradient checkpointing, mini_batch too large | `seq_length: 128`, gradient checkpointing, `mini_batch_size: 4`, `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` |

## The Pipeline

```
smriti DB (420 sessions)
    → export_sessions.py → sessions.jsonl (7.9 MB)
    → generate_triplets.py (Gemini 2.0 Flash) → triplets.jsonl (2,069 triplets)
    → validate_data.py → train.jsonl (1,700) + val.jsonl (165)
    → train.py (sentence-transformers + CachedMNRL loss) → fine-tuned model
    → eval.py → metrics comparison
    → convert_gguf.py → GGUF for QMD
```

Each triplet contains:
- **Query**: 2–8 word search query (what a user would type into `smriti search`)
- **Positive**: 50–300 word relevant passage from the session
- **Hard negative**: A passage from the *same* conversation that's topically related but answers a different question

Train/val split is by session (not by triplet) to prevent data leakage.

## Results

```
                    Base Model    Fine-Tuned    Change
Accuracy            0.8727        0.9152        +0.0424 (+4.9%)
Margin              0.1716        0.2452        +0.0736 (+42.9%)
Positive Sim        0.5608        0.5226        -0.0382
Negative Sim        0.3893        0.2774        -0.1119
```

Both positive and negative similarity dropped, but **negative similarity dropped 3x harder** (0.39 → 0.28 vs 0.56 → 0.52). The model learned to push irrelevant results far apart while keeping relevant ones close. This is exactly what you want for retrieval — fewer false positives, cleaner separation.

### Final Working Colab Config

| Parameter | Value |
|-----------|-------|
| `max_seq_length` | 128 |
| `per_device_train_batch_size` | 4 |
| `gradient_accumulation_steps` | 16 (effective batch = 64) |
| `mini_batch_size` (CachedMNRL) | 4 |
| `num_train_epochs` | 3 |
| `learning_rate` | 2e-5 |
| `gradient_checkpointing` | true |
| `fp16` | true |

## What's Next

The end state isn't a separate repo — it's `smriti finetune`:

- **`smriti finetune`** — Subcommand that retrains the embedding model on accumulated sessions. Run after a week of coding, on a cron, or as a post-ingest hook.
- **`smriti finetune --incremental`** — Don't retrain from scratch. Keep the last checkpoint and continue on new sessions only. The model accumulates knowledge over time.
- **`smriti finetune --team`** — Pull sessions from teammates via `smriti sync`, train a shared model. The team's collective vocabulary becomes the model's vocabulary.
- **Reranker fine-tuning** — QMD uses a 0.6B reranker (Qwen3-Reranker). Same triplet data, different training objective. Would compound the embedding improvements.
- **Automatic quality signals** — Use implicit signals from actual usage (clicked results = positive, reformulated queries = hard negatives) instead of synthetic LLM-generated triplets.
- **Per-project adapters** — Train project-specific LoRA adapters (~8MB each) that QMD swaps based on active project.
- **Scheduled retraining** — Weekly cron that runs `smriti finetune --incremental --deploy`. Search silently gets better every Monday.

## Repo

https://github.com/zero8dotdev/smriti-getting-smarter

When	What
Feb 12, 4:44 PM	Built the full pipeline: export sessions → generate triplets → validate → train → eval → convert GGUF. First commit `29df52b`.
Feb 12, evening	Tried Ollama (`qwen3:8b`) for triplet generation. Too slow for 420 sessions — would take hours locally.
Feb 12–13	Switched to Gemini 2.0 Flash API. Fast and cheap. Generated 2,069 raw triplets → 1,700 after validation/dedup.
Feb 13, morning	Attempted local training on M3 Pro (18GB). OOM immediately with `seq_length: 512, batch_size: 8`. Reduced batch size, seq length, disabled fp16, switched loss function. Still OOM.
Feb 13, ~10:00 AM	Pivoted to Google Colab (T4 GPU, 15GB VRAM, free tier)
Feb 13, 10:00–10:44 AM	6+ failed Colab runs. T4 OOM with initial settings. Progressively lowered seq_length (512→256→128), added gradient checkpointing, tuned mini_batch_size, set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`.
Feb 13, 10:44 AM	First successful training run. Commit `6af8a2b`.
Feb 13, shortly after	Evaluation: accuracy 87.3% → 91.5%, margin +43% relative.

Parameter	Value
`max_seq_length`	128
`per_device_train_batch_size`	4
`gradient_accumulation_steps`	16 (effective batch = 64)
`mini_batch_size` (CachedMNRL)	4
`num_train_epochs`	3
`learning_rate`	2e-5
`gradient_checkpointing`	true
`fp16`	true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuned EmbeddingGemma-300M on Smriti sessions — journey, results, and next steps #17

TL;DR

The Idea

Timeline

What Failed & What Fixed It

The Pipeline

Results

Final Working Colab Config

What's Next

Repo

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure	Root Cause	Fix
Ollama triplet generation too slow	`qwen3:8b` running locally on CPU, 420 sessions	Switched to Gemini 2.0 Flash API
MPS OOM on M3 Pro (18GB)	`seq_length: 512`, `batch_size: 8`, fp16 on MPS	Reduced to `seq_length: 256`, `batch_size: 2`, disabled fp16, added gradient accumulation
Still OOM on MPS after reductions	MPS memory management fundamentally limited for training	Pivoted to Colab T4
T4 OOM on Colab (attempts 1–6)	`seq_length: 256`, no gradient checkpointing, mini_batch too large	`seq_length: 128`, gradient checkpointing, `mini_batch_size: 4`, `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

Fine-tuned EmbeddingGemma-300M on Smriti sessions — journey, results, and next steps #17

Description

TL;DR

The Idea

Timeline

What Failed & What Fixed It

The Pipeline

Results

Final Working Colab Config

What's Next

Repo

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions