feat: similarity scoring overhaul (Phases 1-3) by miethe · Pull Request #102 · miethe/skillmeat

miethe · 2026-02-26T19:35:57Z

Summary

Phase 1: Replaced metadata scoring with bigram + BM25 text similarity methods; rebalanced composite weights; added text_score field to similarity API and breakdown tooltip
Phase 2: Added SimilarityCacheManager with FTS5 pre-filtering, fingerprint columns, cache invalidation on refresh/import, and cache-first similar endpoint with X-Cache headers; comprehensive test suite
Phase 3: Added SentenceTransformerEmbedder and ArtifactEmbedding ORM model; integrated semantic scoring into cache manager; web UI displays semantic scores and embedding indicator; full test coverage

Changes

skillmeat/core/scoring/ — new text_similarity.py, embedder.py; renamed HaikuEmbedder → AnthropicEmbedder; updated match_analyzer, semantic_scorer, service
skillmeat/cache/ — new SimilarityCacheManager, SimilarityCache/ArtifactEmbedding ORM models, 3 Alembic migrations
skillmeat/api/ — updated artifacts.py router (cache-first endpoint), artifact_cache_service.py, schemas/artifacts.py
skillmeat/web/ — updated similar-artifacts-tab.tsx, use-similar-artifacts.ts, lib/api.ts, types/similarity.ts, mini-artifact-card.tsx
tests/ — 4 new test files (test_text_similarity, test_match_analyzer, test_similarity_cache, test_embedder) + 844-2000+ new test lines
docs/ + CHANGELOG.md + README.md updated

Test plan

Run pytest tests/test_text_similarity.py tests/test_match_analyzer.py tests/test_similarity_cache.py tests/test_embedder.py
Verify X-Cache: HIT/MISS headers on /api/v1/artifacts/{id}/similar
Check similar tab shows semantic scores and embedding indicator in web UI
Verify cache invalidation fires on POST /cache/refresh

🤖 Generated with Claude Code

SSO-1.1: Create character bigram Jaccard for name similarity and pure-Python BM25-style description comparison. Zero new dependencies. Strips hyphens/underscores for name normalization, filters domain stop-words for descriptions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-1.2: Replace description length ratio with bm25_description_similarity and title token Jaccard with bigram_similarity. Rebalance sub-weights: tags=0.30, type=0.15, title=0.25, description=0.25, length=0.05. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-1.3: Update weights to keyword=0.25, metadata=0.30, content=0.20, structure=0.15, semantic=0.10. Fallback redistribution preserves 1.0 sum via proportional formula. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-1.4: Add optional text_score to ScoreBreakdown dataclass, DTO schema, and router mapping. Falls back to metadata_score when text scoring hasn't populated the field directly. Backward compatible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-1.5: Add text_score to SimilarityBreakdown TypeScript type and render "Text: XX%" row in MiniArtifactCard score breakdown tooltip. Hidden when null/undefined for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-1.6: 44 tests covering bigram_similarity, bm25_description_similarity, and rebalanced _compute_metadata_score. Validates thresholds: same-desc different-name >= 0.6, same-name different-desc >= 0.4. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All 6 tasks (SSO-1.1 through SSO-1.6) completed. 44 tests passing. Progress tracking and plan status updated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… table SSO-2.1: Add artifact_content_hash, artifact_structure_hash, artifact_file_count, artifact_total_size columns to CollectionArtifact for content-based scoring. SSO-2.2: Create SimilarityCache ORM model with composite PK, FK cascade deletes, and score-descending index for efficient similarity lookups. SSO-2.7: Add FTS5 virtual table (artifact_fts) with porter/ascii tokenizer for BM25 candidate pre-filtering. Includes insert/delete/update triggers and graceful fallback when FTS5 module unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-2.3: Create SimilarityCacheManager with get_similar(), compute_and_store(), invalidate(), and rebuild_all() methods. FTS5 pre-filter reduces candidates to top-50 before full scoring with graceful fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-2.4: Compute artifact_content_hash, artifact_structure_hash, artifact_file_count, artifact_total_size from filesystem during refresh_single_artifact_cache() and populate_collection_artifact_from_import(). Update _fingerprint_from_row() to read from CollectionArtifact columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-2.5: After refresh_single_artifact_cache() syncs an artifact, invalidate and recompute its similarity cache. Non-fatal — cache rebuild failures are logged as warnings without blocking refresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-2.6: Update GET /api/v1/artifacts/{id}/similar to read from SimilarityCacheManager first. Returns X-Cache: HIT/MISS and X-Cache-Age headers. Falls back to live computation on cache miss, then persists results for next request. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-2.8: Extract X-Cache and X-Cache-Age headers from similar endpoint. Show subtle "cached 2m ago" indicator on cache HIT. Add apiRequestWithHeaders utility. Reduce stale time to 30s per interactive tier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-2.9: 26 tests covering get_similar (miss, ordering, limit, min_score), compute_and_store (persist, replace, top-20 limit), invalidate (source, target, both), cache hit/miss integration, and content score with fingerprint columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All 9 tasks (SSO-2.1 through SSO-2.9) completed: - Schema: fingerprint columns, SimilarityCache model, FTS5 virtual table - Cache: SimilarityCacheManager with FTS5 pre-filtering - API: cache-first endpoint with X-Cache headers - Frontend: cache age indicator - Tests: 26 new tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…er to AnthropicEmbedder SSO-3.1: Create local sentence-transformers embedder implementing EmbeddingProvider interface with lazy model loading and thread-safe async execution. Rename non-functional HaikuEmbedder to AnthropicEmbedder with is_available() returning False. Update all imports across codebase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSO-3.2: Add sentence-transformers>=2.7.0 as optional [semantic] dependency SSO-3.3: Add ArtifactEmbedding table with embedding BLOB, model_name, embedding_dim, and computed_at columns. Alembic migration with FK cascade delete to artifacts table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntic scoring SSO-3.4: Wire SentenceTransformerEmbedder into compute_and_store() flow. Add _get_or_compute_embedding() for cache-first embedding retrieval and _cosine_similarity() for vector comparison. Store embeddings as float32 BLOB in ArtifactEmbedding table. Full composite weights (semantic=0.10) used when embeddings available; fallback weights when not. Also fix FTS5 trigger bug: replace non-existent a.title column with NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hase 3 tests SSO-3.5: Show real semantic percentages in similar-artifacts-tab when embeddings are available. Add subtle EmbeddingStatusIndicator with Zap/ZapOff icons and tooltip for embedding availability status. SSO-3.6: Add comprehensive test_embedder.py with 29 tests covering SentenceTransformerEmbedder availability/embedding/lazy-loading, AnthropicEmbedder always-unavailable, cosine similarity edge cases, and cache manager embedding integration. All tests work without sentence-transformers installed via mocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All 18 tasks across 3 phases completed: - Phase 1: Fixed scoring algorithm (bigram similarity, rebalanced weights) - Phase 2: Pre-computation cache (SimilarityCache, FTS5, cache headers) - Phase 3: Optional embedding enhancement (sentence-transformers, ORM storage) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

miethe and others added 22 commits February 26, 2026 12:22

chore: mark Phase 1 similarity scoring overhaul complete

ba18864

All 6 tasks (SSO-1.1 through SSO-1.6) completed. 44 tests passing. Progress tracking and plan status updated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: mark similarity scoring overhaul PRD as complete

c4b56d1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs: updating docs from similarity-scoring

ebd2135

miethe merged commit d212dba into feat/similar-artifacts Feb 26, 2026
3 of 8 checks passed

miethe deleted the feat/similarity-overhaul branch February 26, 2026 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: similarity scoring overhaul (Phases 1-3)#102

feat: similarity scoring overhaul (Phases 1-3)#102
miethe merged 22 commits intofeat/similar-artifactsfrom
feat/similarity-overhaul

miethe commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

miethe commented Feb 26, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant