Structured Memory Pipeline — Full Roadmap

## What

Transform Smriti from flat text ingestion to a **structured, queryable memory pipeline** — where every tool call, file edit, git operation, error, and thinking block is parsed, typed, stored in sidecar tables, and available for analytics, search, and team sharing.

## Why

Currently Smriti drops 80%+ of the structured data in AI coding sessions. A Claude Code transcript contains tool calls with typed inputs, file diffs, command outputs, git operations, token costs, and thinking blocks — but the flat text parser reduces all of this to a single string. This means:

- **No file tracking**: Can't answer "what files did I edit this week?"
- **No error analysis**: Can't find sessions where builds failed or tests broke
- **No cost visibility**: No token/cost tracking across sessions or projects
- **No git correlation**: Can't link sessions to commits, branches, or PRs
- **No cross-agent view**: Different agents (Claude, Cline, Aider) can't share a unified memory
- **No security layer**: Secrets in sessions get shared without redaction

This roadmap addresses all of these gaps across 5 phases.

## Sub-Issues

- #5 **[DONE]** Enriched Claude Code Parser — Structured block extraction, 13 block types, 6 sidecar tables
- #6 Cline + Aider Agent Parsers — New agent support for unified cross-tool memory
- #7 Auto-Ingestion Watch Daemon — `smriti watch` with fs.watch for real-time ingestion
- #8 Enhanced Search & Analytics on Structured Data — Query sidecar tables, activity timelines, cost tracking
- #9 Secret Redaction & Policy Engine — Detect and redact secrets before storage and sharing
- #10 Telemetry & Metrics Collection — Local-only opt-in usage metrics
- #11 Real User Testing & Performance Validation — Benchmarks, stress tests, security tests

## Phase Overview

| Phase | Deliverable | Status |
|-------|------------|--------|
| **Phase 1** | Enriched Claude Code Parser (#5) | **Done** — 13 block types, 6 sidecar tables, 142 tests |
| **Phase 2** | Cline + Aider Parsers (#6) | Planned |
| **Phase 3** | Watch Daemon (#7) + Search & Analytics (#8) | Planned |
| **Phase 4** | Secret Redaction & Policy (#9) | Planned |
| **Phase 5** | Telemetry (#10) + Testing & Perf (#11) | Planned |

## Storage Inventory

Complete map of every data type, where it lives, and whether it's indexed:

| Data | Source | Table | Key Columns | Indexed? |
|------|--------|-------|-------------|----------|
| Session text (FTS) | All agents | `memory_fts` (QMD) | content | FTS5 full-text |
| Session metadata | Ingestion | `smriti_session_meta` | session_id, agent_id, project_id | Yes (agent, project) |
| Project registry | Path derivation | `smriti_projects` | id, path, description | PK |
| Agent registry | Seed data | `smriti_agents` | id, parser, log_pattern | PK |
| Tool usage | Block extraction | `smriti_tool_usage` | message_id, tool_name, success, duration_ms | Yes (session, tool_name) |
| File operations | Block extraction | `smriti_file_operations` | message_id, operation, file_path, project_id | Yes (session, path) |
| Commands | Block extraction | `smriti_commands` | message_id, command, exit_code, is_git | Yes (session, is_git) |
| Git operations | Block extraction | `smriti_git_operations` | message_id, operation, branch, pr_url | Yes (session, operation) |
| Errors | Block extraction | `smriti_errors` | message_id, error_type, message | Yes (session, type) |
| Token costs | Metadata accumulation | `smriti_session_costs` | session_id, model, input/output/cache tokens, cost | PK |
| Category tags (session) | Categorization | `smriti_session_tags` | session_id, category_id, confidence, source | Yes (category) |
| Category tags (message) | Categorization | `smriti_message_tags` | message_id, category_id, confidence, source | Yes (category) |
| Category taxonomy | Seed data | `smriti_categories` | id, name, parent_id | PK |
| Share tracking | Team sharing | `smriti_shares` | session_id, content_hash, author | Yes (hash) |
| Vector embeddings | `smriti embed` | `content_vectors` + `vectors_vec` (QMD) | content_hash, embedding | Virtual table |
| Telemetry events | Opt-in collection | `~/.smriti/telemetry.json` | timestamp, event, data | N/A (JSONL file) |
| Structured blocks | Block extraction | `memory_messages.metadata.blocks` (JSON) | MessageBlock[] | No (JSON blob) |
| Message metadata | Parsing | `memory_messages.metadata` (JSON) | cwd, gitBranch, model, tokenUsage | No (JSON blob) |

## Block Type Reference

The 13 `MessageBlock` types extracted during ingestion:

| Block Type | Fields | Stored In |
|-----------|--------|-----------|
| `text` | text | FTS (via plainText) |
| `thinking` | thinking, budgetTokens | JSON blob only |
| `tool_call` | toolId, toolName, input | `smriti_tool_usage` |
| `tool_result` | toolId, success, output, error, durationMs | Updates tool_usage success |
| `file_op` | operation, path, diff, pattern | `smriti_file_operations` |
| `command` | command, cwd, exitCode, stdout, stderr, isGit | `smriti_commands` |
| `search` | searchType, pattern, path, url, resultCount | JSON blob only |
| `git` | operation, branch, message, files, prUrl, prNumber | `smriti_git_operations` |
| `error` | errorType, message, retryable | `smriti_errors` |
| `image` | mediaType, path, dataHash | JSON blob only |
| `code` | language, code, filePath, lineStart | JSON blob only |
| `system_event` | eventType, data | Cost accumulation |
| `control` | controlType, command | JSON blob only |

## Real User Testing Plan

| Scenario | What to Measure | Risk if Untested |
|----------|----------------|-----------------|
| Fresh install + first ingest | Time-to-first-search, error quality | Bad first impression, confusing errors |
| 500+ sessions accumulated | Search latency, DB file size, `smriti status` accuracy | Performance cliff after months of use |
| Multi-project workspace | Project ID derivation accuracy, cross-project search | Wrong project attribution for sessions |
| Team sharing (2+ devs) | Sync conflicts, dedup accuracy, content hash stability | Duplicate or lost knowledge articles |
| Long-running session (4+ hrs) | Memory during ingest, block count accuracy, cost tracking | OOM or missed data at end of session |
| Rapid session creation | Watch daemon debouncing, no duplicate ingestion | Double-counting sessions |
| Agent switch mid-task | Cross-agent file tracking, unified timeline | Gaps in activity log |
| Secret in session | Detection rate, redaction completeness, share blocking | Leaked credentials in `.smriti/` |
| Large JSONL file (50MB+) | Parse time, memory usage, incremental ingest | Crash or multi-minute ingest |
| Corrupt/truncated files | Error messages, graceful skip, no data loss | Silent data corruption |

## Configuration Reference

| Env Var | Default | Phase | Description |
|---------|---------|-------|-------------|
| `QMD_DB_PATH` | `~/.cache/qmd/index.sqlite` | — | Database path |
| `CLAUDE_LOGS_DIR` | `~/.claude/projects` | 1 | Claude Code logs |
| `CODEX_LOGS_DIR` | `~/.codex` | — | Codex CLI logs |
| `SMRITI_PROJECTS_ROOT` | `~/zero8.dev` | 1 | Projects root for ID derivation |
| `OLLAMA_HOST` | `http://127.0.0.1:11434` | — | Ollama endpoint |
| `QMD_MEMORY_MODEL` | `qwen3:8b-tuned` | — | Ollama model for synthesis |
| `SMRITI_CLASSIFY_THRESHOLD` | `0.5` | — | LLM classification trigger |
| `SMRITI_AUTHOR` | `$USER` | — | Git author for team sharing |
| `SMRITI_WATCH_DEBOUNCE_MS` | `2000` | 3 | Watch daemon debounce interval |
| `SMRITI_TELEMETRY` | `0` | 5 | Enable telemetry collection |

## Current State

Phase 1 is complete:
- 13 structured block types defined in `src/ingest/types.ts`
- Block extraction engine in `src/ingest/blocks.ts`
- Enriched Claude parser in `src/ingest/claude.ts`
- 6 sidecar tables in `src/db.ts` with indexes and insert helpers
- 142 tests passing, 415 expect() calls across 9 test files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured Memory Pipeline — Full Roadmap #12

What

Why

Sub-Issues

Phase Overview

Storage Inventory

Block Type Reference

Real User Testing Plan

Configuration Reference

Current State

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Phase	Deliverable	Status
Phase 1	Enriched Claude Code Parser (#5)	Done — 13 block types, 6 sidecar tables, 142 tests
Phase 2	Cline + Aider Parsers (#6)	Planned
Phase 3	Watch Daemon (#7) + Search & Analytics (#8)	Planned
Phase 4	Secret Redaction & Policy (#9)	Planned
Phase 5	Telemetry (#10) + Testing & Perf (#11)	Planned

Data	Source	Table	Key Columns	Indexed?
Session text (FTS)	All agents	`memory_fts` (QMD)	content	FTS5 full-text
Session metadata	Ingestion	`smriti_session_meta`	session_id, agent_id, project_id	Yes (agent, project)
Project registry	Path derivation	`smriti_projects`	id, path, description	PK
Agent registry	Seed data	`smriti_agents`	id, parser, log_pattern	PK
Tool usage	Block extraction	`smriti_tool_usage`	message_id, tool_name, success, duration_ms	Yes (session, tool_name)
File operations	Block extraction	`smriti_file_operations`	message_id, operation, file_path, project_id	Yes (session, path)
Commands	Block extraction	`smriti_commands`	message_id, command, exit_code, is_git	Yes (session, is_git)
Git operations	Block extraction	`smriti_git_operations`	message_id, operation, branch, pr_url	Yes (session, operation)
Errors	Block extraction	`smriti_errors`	message_id, error_type, message	Yes (session, type)
Token costs	Metadata accumulation	`smriti_session_costs`	session_id, model, input/output/cache tokens, cost	PK
Category tags (session)	Categorization	`smriti_session_tags`	session_id, category_id, confidence, source	Yes (category)
Category tags (message)	Categorization	`smriti_message_tags`	message_id, category_id, confidence, source	Yes (category)
Category taxonomy	Seed data	`smriti_categories`	id, name, parent_id	PK
Share tracking	Team sharing	`smriti_shares`	session_id, content_hash, author	Yes (hash)
Vector embeddings	`smriti embed`	`content_vectors` + `vectors_vec` (QMD)	content_hash, embedding	Virtual table
Telemetry events	Opt-in collection	`~/.smriti/telemetry.json`	timestamp, event, data	N/A (JSONL file)
Structured blocks	Block extraction	`memory_messages.metadata.blocks` (JSON)	MessageBlock[]	No (JSON blob)
Message metadata	Parsing	`memory_messages.metadata` (JSON)	cwd, gitBranch, model, tokenUsage	No (JSON blob)

Block Type	Fields	Stored In
`text`	text	FTS (via plainText)
`thinking`	thinking, budgetTokens	JSON blob only
`tool_call`	toolId, toolName, input	`smriti_tool_usage`
`tool_result`	toolId, success, output, error, durationMs	Updates tool_usage success
`file_op`	operation, path, diff, pattern	`smriti_file_operations`
`command`	command, cwd, exitCode, stdout, stderr, isGit	`smriti_commands`
`search`	searchType, pattern, path, url, resultCount	JSON blob only
`git`	operation, branch, message, files, prUrl, prNumber	`smriti_git_operations`
`error`	errorType, message, retryable	`smriti_errors`
`image`	mediaType, path, dataHash	JSON blob only
`code`	language, code, filePath, lineStart	JSON blob only
`system_event`	eventType, data	Cost accumulation
`control`	controlType, command	JSON blob only

Scenario	What to Measure	Risk if Untested
Fresh install + first ingest	Time-to-first-search, error quality	Bad first impression, confusing errors
500+ sessions accumulated	Search latency, DB file size, `smriti status` accuracy	Performance cliff after months of use
Multi-project workspace	Project ID derivation accuracy, cross-project search	Wrong project attribution for sessions
Team sharing (2+ devs)	Sync conflicts, dedup accuracy, content hash stability	Duplicate or lost knowledge articles
Long-running session (4+ hrs)	Memory during ingest, block count accuracy, cost tracking	OOM or missed data at end of session
Rapid session creation	Watch daemon debouncing, no duplicate ingestion	Double-counting sessions
Agent switch mid-task	Cross-agent file tracking, unified timeline	Gaps in activity log
Secret in session	Detection rate, redaction completeness, share blocking	Leaked credentials in `.smriti/`
Large JSONL file (50MB+)	Parse time, memory usage, incremental ingest	Crash or multi-minute ingest
Corrupt/truncated files	Error messages, graceful skip, no data loss	Silent data corruption

Env Var	Default	Phase	Description
`QMD_DB_PATH`	`~/.cache/qmd/index.sqlite`	—	Database path
`CLAUDE_LOGS_DIR`	`~/.claude/projects`	1	Claude Code logs
`CODEX_LOGS_DIR`	`~/.codex`	—	Codex CLI logs
`SMRITI_PROJECTS_ROOT`	`~/zero8.dev`	1	Projects root for ID derivation
`OLLAMA_HOST`	`http://127.0.0.1:11434`	—	Ollama endpoint
`QMD_MEMORY_MODEL`	`qwen3:8b-tuned`	—	Ollama model for synthesis
`SMRITI_CLASSIFY_THRESHOLD`	`0.5`	—	LLM classification trigger
`SMRITI_AUTHOR`	`$USER`	—	Git author for team sharing
`SMRITI_WATCH_DEBOUNCE_MS`	`2000`	3	Watch daemon debounce interval
`SMRITI_TELEMETRY`	`0`	5	Enable telemetry collection

Structured Memory Pipeline — Full Roadmap #12

Description

What

Why

Sub-Issues

Phase Overview

Storage Inventory

Block Type Reference

Real User Testing Plan

Configuration Reference

Current State

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions