Real User Testing & Performance Validation

## What
A comprehensive testing and benchmarking plan that validates Smriti against real-world usage scenarios: large databases, concurrent access, cross-agent queries, and performance under load.

## Why
Unit tests verify correctness in isolation, but real usage involves hundreds of sessions, thousands of messages, multiple agents writing simultaneously, and databases that grow over months. We need to validate performance doesn't degrade and structured data stays consistent at scale.

## Tasks

### Correctness Testing
- [ ] **Round-trip fidelity**: ingest → search → recall → share produces accurate, complete results
- [ ] **Cross-agent dedup**: same session referenced by multiple agents doesn't create duplicates
- [ ] **Sidecar consistency**: every tool_call block has a matching \`smriti_tool_usage\` row
- [ ] **Category integrity**: hierarchical categories maintain parent-child relationships after bulk operations
- [ ] **Share/sync round-trip**: \`smriti share\` → \`smriti sync\` on another machine restores all metadata

### Performance Benchmarks
- [ ] **Ingestion throughput**: time to ingest 100/500/1000 sessions
- [ ] **Search latency**: FTS query time at 1k/10k/50k messages (target: < 50ms at 10k)
- [ ] **Vector search latency**: embedding search at 1k/10k vectors (target: < 200ms at 10k)
- [ ] **Sidecar query speed**: analytics queries on sidecar tables at scale
- [ ] **Database size**: measure SQLite file size at 1k/10k/50k messages
- [ ] **Memory usage**: peak RSS during ingestion of large sessions (target: < 256MB)
- [ ] **Watch daemon overhead**: CPU/memory when idle vs during active session

### Stress Testing
- [ ] **Large session files**: JSONL files > 50MB (long coding sessions)
- [ ] **Many small sessions**: 1000+ sessions with < 10 messages each
- [ ] **Concurrent ingestion**: two agents writing to DB simultaneously
- [ ] **Corrupt data handling**: malformed JSONL, truncated files, missing fields
- [ ] **Disk space**: behavior when SQLite DB approaches filesystem limits

### Security Testing
- [ ] **Secret detection coverage**: test against curated list of real secret patterns
- [ ] **Redaction completeness**: no secrets survive ingestion → search → share pipeline
- [ ] **Path traversal**: crafted file paths in tool calls don't escape expected directories
- [ ] **SQL injection**: category names, project IDs with special characters

## Files
- \`test/benchmark.test.ts\` — **new** Performance benchmarks
- \`test/stress.test.ts\` — **new** Stress and edge case tests
- \`test/security.test.ts\` — **new** Security validation tests
- \`test/e2e.test.ts\` — **new** End-to-end round-trip tests
- \`test/fixtures/large/\` — **new** Large synthetic test data
- \`scripts/generate-fixtures.ts\` — **new** Test data generator

## Acceptance Criteria
- [ ] All correctness tests pass on a clean install
- [ ] Ingestion throughput: ≥ 50 sessions/second
- [ ] FTS search: < 50ms at 10k messages
- [ ] Vector search: < 200ms at 10k vectors
- [ ] No memory leaks during 1-hour watch daemon run
- [ ] Zero secrets survive the full pipeline in security tests
- [ ] Corrupt/malformed input produces clear error messages, never crashes

## Real User Testing Plan

| Scenario | What to Measure | Risk if Untested |
|----------|----------------|-----------------|
| Fresh install + first ingest | Time-to-first-search, error messages | Bad first impression |
| 500+ sessions accumulated | Search latency, DB size, \`smriti status\` accuracy | Performance cliff |
| Multi-project workspace | Project ID derivation accuracy, cross-project search | Wrong project attribution |
| Team sharing (2+ developers) | Sync conflicts, dedup accuracy, content hash stability | Duplicate/lost knowledge |
| Long-running session (4+ hours) | Memory during ingest, block count accuracy, cost tracking | OOM or missed data |
| Rapid session creation | Watch daemon debouncing, no duplicate ingestion | Double-counting |
| Agent switch mid-task | Cross-agent file operation tracking, timeline accuracy | Gaps in activity log |

## Testing
```bash
bun test test/benchmark.test.ts   # Performance benchmarks
bun test test/stress.test.ts      # Stress tests
bun test test/security.test.ts    # Security validation
bun test test/e2e.test.ts         # End-to-end round-trips
bun run scripts/generate-fixtures.ts  # Generate large test data
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real User Testing & Performance Validation #11

What

Why

Tasks

Correctness Testing

Performance Benchmarks

Stress Testing

Security Testing

Files

Acceptance Criteria

Real User Testing Plan

Testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	What to Measure	Risk if Untested
Fresh install + first ingest	Time-to-first-search, error messages	Bad first impression
500+ sessions accumulated	Search latency, DB size, `smriti status` accuracy	Performance cliff
Multi-project workspace	Project ID derivation accuracy, cross-project search	Wrong project attribution
Team sharing (2+ developers)	Sync conflicts, dedup accuracy, content hash stability	Duplicate/lost knowledge
Long-running session (4+ hours)	Memory during ingest, block count accuracy, cost tracking	OOM or missed data
Rapid session creation	Watch daemon debouncing, no duplicate ingestion	Double-counting
Agent switch mid-task	Cross-agent file operation tracking, timeline accuracy	Gaps in activity log

Real User Testing & Performance Validation #11

Description

What

Why

Tasks

Correctness Testing

Performance Benchmarks

Stress Testing

Security Testing

Files

Acceptance Criteria

Real User Testing Plan

Testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions