-
Notifications
You must be signed in to change notification settings - Fork 3
Description
What
Transform Smriti from flat text ingestion to a structured, queryable memory pipeline — where every tool call, file edit, git operation, error, and thinking block is parsed, typed, stored in sidecar tables, and available for analytics, search, and team sharing.
Why
Currently Smriti drops 80%+ of the structured data in AI coding sessions. A Claude Code transcript contains tool calls with typed inputs, file diffs, command outputs, git operations, token costs, and thinking blocks — but the flat text parser reduces all of this to a single string. This means:
- No file tracking: Can't answer "what files did I edit this week?"
- No error analysis: Can't find sessions where builds failed or tests broke
- No cost visibility: No token/cost tracking across sessions or projects
- No git correlation: Can't link sessions to commits, branches, or PRs
- No cross-agent view: Different agents (Claude, Cline, Aider) can't share a unified memory
- No security layer: Secrets in sessions get shared without redaction
This roadmap addresses all of these gaps across 5 phases.
Sub-Issues
- [DONE] Enriched Claude Code Parser #5 [DONE] Enriched Claude Code Parser — Structured block extraction, 13 block types, 6 sidecar tables
- Cline + Aider Agent Parsers #6 Cline + Aider Agent Parsers — New agent support for unified cross-tool memory
- Auto-Ingestion Watch Daemon #7 Auto-Ingestion Watch Daemon —
smriti watchwith fs.watch for real-time ingestion - Enhanced Search & Analytics on Structured Data #8 Enhanced Search & Analytics on Structured Data — Query sidecar tables, activity timelines, cost tracking
- Secret Redaction & Policy Engine #9 Secret Redaction & Policy Engine — Detect and redact secrets before storage and sharing
- Telemetry & Metrics Collection #10 Telemetry & Metrics Collection — Local-only opt-in usage metrics
- Real User Testing & Performance Validation #11 Real User Testing & Performance Validation — Benchmarks, stress tests, security tests
Phase Overview
| Phase | Deliverable | Status |
|---|---|---|
| Phase 1 | Enriched Claude Code Parser (#5) | Done — 13 block types, 6 sidecar tables, 142 tests |
| Phase 2 | Cline + Aider Parsers (#6) | Planned |
| Phase 3 | Watch Daemon (#7) + Search & Analytics (#8) | Planned |
| Phase 4 | Secret Redaction & Policy (#9) | Planned |
| Phase 5 | Telemetry (#10) + Testing & Perf (#11) | Planned |
Storage Inventory
Complete map of every data type, where it lives, and whether it's indexed:
| Data | Source | Table | Key Columns | Indexed? |
|---|---|---|---|---|
| Session text (FTS) | All agents | memory_fts (QMD) |
content | FTS5 full-text |
| Session metadata | Ingestion | smriti_session_meta |
session_id, agent_id, project_id | Yes (agent, project) |
| Project registry | Path derivation | smriti_projects |
id, path, description | PK |
| Agent registry | Seed data | smriti_agents |
id, parser, log_pattern | PK |
| Tool usage | Block extraction | smriti_tool_usage |
message_id, tool_name, success, duration_ms | Yes (session, tool_name) |
| File operations | Block extraction | smriti_file_operations |
message_id, operation, file_path, project_id | Yes (session, path) |
| Commands | Block extraction | smriti_commands |
message_id, command, exit_code, is_git | Yes (session, is_git) |
| Git operations | Block extraction | smriti_git_operations |
message_id, operation, branch, pr_url | Yes (session, operation) |
| Errors | Block extraction | smriti_errors |
message_id, error_type, message | Yes (session, type) |
| Token costs | Metadata accumulation | smriti_session_costs |
session_id, model, input/output/cache tokens, cost | PK |
| Category tags (session) | Categorization | smriti_session_tags |
session_id, category_id, confidence, source | Yes (category) |
| Category tags (message) | Categorization | smriti_message_tags |
message_id, category_id, confidence, source | Yes (category) |
| Category taxonomy | Seed data | smriti_categories |
id, name, parent_id | PK |
| Share tracking | Team sharing | smriti_shares |
session_id, content_hash, author | Yes (hash) |
| Vector embeddings | smriti embed |
content_vectors + vectors_vec (QMD) |
content_hash, embedding | Virtual table |
| Telemetry events | Opt-in collection | ~/.smriti/telemetry.json |
timestamp, event, data | N/A (JSONL file) |
| Structured blocks | Block extraction | memory_messages.metadata.blocks (JSON) |
MessageBlock[] | No (JSON blob) |
| Message metadata | Parsing | memory_messages.metadata (JSON) |
cwd, gitBranch, model, tokenUsage | No (JSON blob) |
Block Type Reference
The 13 MessageBlock types extracted during ingestion:
| Block Type | Fields | Stored In |
|---|---|---|
text |
text | FTS (via plainText) |
thinking |
thinking, budgetTokens | JSON blob only |
tool_call |
toolId, toolName, input | smriti_tool_usage |
tool_result |
toolId, success, output, error, durationMs | Updates tool_usage success |
file_op |
operation, path, diff, pattern | smriti_file_operations |
command |
command, cwd, exitCode, stdout, stderr, isGit | smriti_commands |
search |
searchType, pattern, path, url, resultCount | JSON blob only |
git |
operation, branch, message, files, prUrl, prNumber | smriti_git_operations |
error |
errorType, message, retryable | smriti_errors |
image |
mediaType, path, dataHash | JSON blob only |
code |
language, code, filePath, lineStart | JSON blob only |
system_event |
eventType, data | Cost accumulation |
control |
controlType, command | JSON blob only |
Real User Testing Plan
| Scenario | What to Measure | Risk if Untested |
|---|---|---|
| Fresh install + first ingest | Time-to-first-search, error quality | Bad first impression, confusing errors |
| 500+ sessions accumulated | Search latency, DB file size, smriti status accuracy |
Performance cliff after months of use |
| Multi-project workspace | Project ID derivation accuracy, cross-project search | Wrong project attribution for sessions |
| Team sharing (2+ devs) | Sync conflicts, dedup accuracy, content hash stability | Duplicate or lost knowledge articles |
| Long-running session (4+ hrs) | Memory during ingest, block count accuracy, cost tracking | OOM or missed data at end of session |
| Rapid session creation | Watch daemon debouncing, no duplicate ingestion | Double-counting sessions |
| Agent switch mid-task | Cross-agent file tracking, unified timeline | Gaps in activity log |
| Secret in session | Detection rate, redaction completeness, share blocking | Leaked credentials in .smriti/ |
| Large JSONL file (50MB+) | Parse time, memory usage, incremental ingest | Crash or multi-minute ingest |
| Corrupt/truncated files | Error messages, graceful skip, no data loss | Silent data corruption |
Configuration Reference
| Env Var | Default | Phase | Description |
|---|---|---|---|
QMD_DB_PATH |
~/.cache/qmd/index.sqlite |
— | Database path |
CLAUDE_LOGS_DIR |
~/.claude/projects |
1 | Claude Code logs |
CODEX_LOGS_DIR |
~/.codex |
— | Codex CLI logs |
SMRITI_PROJECTS_ROOT |
~/zero8.dev |
1 | Projects root for ID derivation |
OLLAMA_HOST |
http://127.0.0.1:11434 |
— | Ollama endpoint |
QMD_MEMORY_MODEL |
qwen3:8b-tuned |
— | Ollama model for synthesis |
SMRITI_CLASSIFY_THRESHOLD |
0.5 |
— | LLM classification trigger |
SMRITI_AUTHOR |
$USER |
— | Git author for team sharing |
SMRITI_WATCH_DEBOUNCE_MS |
2000 |
3 | Watch daemon debounce interval |
SMRITI_TELEMETRY |
0 |
5 | Enable telemetry collection |
Current State
Phase 1 is complete:
- 13 structured block types defined in
src/ingest/types.ts - Block extraction engine in
src/ingest/blocks.ts - Enriched Claude parser in
src/ingest/claude.ts - 6 sidecar tables in
src/db.tswith indexes and insert helpers - 142 tests passing, 415 expect() calls across 9 test files