Skip to content

clafollett/codetriever

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

102 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Codetriever - πŸ•β€πŸ¦ΊπŸΎ

Semantic code search for every AI coding agent. Built in Rust. MCP-native.

Give your AI perfect memory of your entire codebase

πŸš€ What's This?

Codetriever is a semantic code search engine with MCP support. The goal is to give AI agents memory of codebases through the Model Context Protocol.

# These commands exist but may not work:
codetriever mcp  # MCP server (untested, may crash)

# These commands don't exist yet:
codetriever index /path/to/repo
codetriever search "database connection pooling logic"

🎯 The Problem We Solve

  • Context windows overflow - Even 200k tokens can't fit real codebases
  • AI agents forget - They can't see your whole project structure
  • Keyword search fails - "auth logic" should find authentication code regardless of naming
  • Cloud is risky - Your proprietary code shouldn't leave your machine

πŸ”₯ Why This Matters

Every AI coding tool needs semantic search. We're building the open protocol that powers them all through MCP. Not locked to Claude, Copilot, or Cursor - works with everything.

πŸ“Š Current Status

βœ… What Actually Works

  • Indexing pipeline - Parse, chunk, embed, store your codebase
  • Tree-sitter parsing - Semantic understanding of 25+ languages
  • Smart chunking - Respects token limits, preserves context
  • Vector storage - Qdrant for embeddings, PostgreSQL for metadata
  • Database tracking - Knows what files have been indexed
  • REST API - Working indexing and search endpoints
  • Semantic search - Returns ranked results from vector database

🀷 What Might Work (Untested)

  • MCP server - Agenterra scaffolded it, never tested with Claude
  • Incremental updates - Code exists, not proven
  • CLI commands - Exist but may have bugs

🚧 Coming Soon

  • Additional API endpoints
    • /similar - Find similar code chunks
    • /context - Get surrounding code context
    • /usages - Find symbol usages
    • /status - System health and metrics
    • /stats - Quick statistics
    • /clean - Remove stale entries
    • /compact - Optimize database
  • CLI commands - Direct terminal access to all features
  • Similar code finder - Find code patterns across your codebase
  • Usage finder - Track where symbols are used

⚠️ Limitations

See LIMITATIONS.md for known issues, hardware requirements, and missing features.

πŸ—οΈ Architecture

Your Code β†’ Tree-sitter Parser β†’ Semantic Chunks β†’ Vector Embeddings β†’ Qdrant
     ↑                                                                    ↓
File Tracking ←────────────────────────────────────────────────────→ Search
(PostgreSQL)                                                        (MCP/API)

πŸ’» System Requirements

Minimum

  • RAM: 16GB (4GB available for embeddings)
  • CPU: Any x64 or ARM64 processor
  • Disk: 2GB for models + space for index
  • OS: macOS, Linux, Windows (WSL2)

Recommended

  • RAM: 32GB+
  • CPU: Apple Silicon (M1/M2/M3) or modern x64 with AVX
  • GPU: NVIDIA with CUDA or Mac with Metal
  • Disk: SSD with 10GB+ free

Prerequisites

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install Just (command runner)
cargo install just
# or on macOS
brew install just

# Docker (for PostgreSQL and Qdrant)
# Install Docker Desktop from https://www.docker.com/products/docker-desktop

From Source

git clone https://github.com/clafollett/codetriever
cd codetriever

# Setup development environment
source stack.env
just dev-setup

# Build and install
cargo install --path crates/codetriever
cargo install --path crates/codetriever-api

Quick Start (What Actually Works)

# Initialize Docker services and database
just init

# Start API server
codetriever-api

# Index via API (requires file CONTENT, not filesystem paths - SaaS-ready!)
# Path should be repo-relative (e.g., "src/main.rs" not "/Users/bob/code/project/src/main.rs")
curl -X POST http://localhost:8080/index \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "my-project",
    "files": [
      {
        "path": "src/main.rs",
        "content": "fn main() { println!(\"Hello\"); }"
      }
    ]
  }'

# Search via API (works! returns semantic results)
curl -X POST http://localhost:8080/search \
  -H "Content-Type: application/json" \
  -d '{"query": "function that prints hello", "limit": 10}'

# MCP server (exists but untested, probably broken)
codetriever mcp

Development

Key Commands

# Initial setup
just dev-setup        # Install dependencies and setup environment
source stack.env      # Load development environment

# Infrastructure (Docker services)
just init            # Initialize Docker services and database
just docker-up       # Start PostgreSQL and Qdrant
just docker-down     # Stop all containers  
just docker-reset    # Clean reset of Docker environment
just docker-logs     # View service logs

# Database
just db-setup        # Initialize database schema
just db-migrate      # Run migrations
just db-reset        # Drop and recreate database

# Development workflow
just test            # Run all tests
just test-unit       # Run unit tests only (fast)
just test-integration # Run integration tests
just fmt             # Format code
just lint            # Run clippy lints
just clippy-fix      # Fix clippy warnings
just check           # Run all quality checks (fmt + lint + test)
just watch           # Watch mode for development

# Building & Running
just build           # Build debug version
just build-release   # Build optimized release
just run [args]      # Run CLI with arguments
just api             # Run API server

# Utility
just clean           # Clean build artifacts
just docs            # Generate and open documentation
just stats           # Show project statistics
just update          # Update dependencies
just audit           # Security audit
just clean-test-data # Clean Qdrant test collections

Common Workflows

# Quick setup for new contributors
just quick-start     # Runs init + test

# Full CI pipeline locally
just ci              # Runs fmt + lint + test + build

# Fix all auto-fixable issues
just fix             # Runs fmt + clippy-fix

# Development mode with auto-reload
just dev             # Starts Docker and watches API

Test Commands

# Run all tests
just test

# Run unit tests only (faster)
just test-unit

# Run integration tests
just test-integration

# Run specific crate tests
cargo test -p codetriever-indexer
cargo test -p codetriever-meta-data

Testing Infrastructure

  • Unit Tests - Comprehensive coverage with mocks
  • Integration Tests - Full stack testing with real components
  • Token Counter Tests - Accuracy and performance validation
  • Byte Offset Tests - Proper position tracking

Key Design Decisions

  • Trait-based abstractions - VectorStorage, EmbeddingService, ContentParser, TokenCounter
  • Token-aware chunking - Respects model context limits
  • Deterministic chunk IDs - UUID v5 based on content for stability
  • Incremental indexing - Git-aware change detection
  • Modular architecture - Clean separation of concerns

Status

🚧 Alpha - Core functionality is working, API stabilizing.

Completed

  • Modular crate structure
  • Tree-sitter parsing for 25+ languages
  • Vector storage with Qdrant
  • PostgreSQL state management
  • Token counting abstractions (Tiktoken, Heuristic)
  • Smart chunking service
  • REST API with Axum
  • Incremental indexing
  • Comprehensive test suite
  • Working semantic search API
  • File content indexing API

In Progress

  • Performance optimization
  • CLI improvements
  • Documentation
  • MCP server implementation

Planned

  • Git integration for history
  • Multiple embedding models
  • Web UI
  • Language-specific improvements

Architecture Documentation

See docs/architecture/current-architecture.md for detailed system design.

🀝 Contributing - We Need You!

Want to make AI coding better? 🦸

Quick Wins for First Contributors

  • Add CLI commands for search/similar/context
  • Upgrade to NEW Jina code models (released Sept 3, 2025! 0.5b/1.5b/GGUF versions)
  • Improve error messages
  • Add more language tests
  • Write documentation
  • Test MCP server integration

How to Contribute

  1. Fork and clone the repository
  2. Pick a TODO from the codebase (they're everywhere!)
  3. Write tests first - We use TDD (Red/Green/Refactor)
  4. Run checks - just test && just clippy-fix
  5. Submit a PR - We review fast!

See CONTRIBUTING.md for setup details.

First PR merged gets a shoutout in the README! πŸŽ‰

πŸ“– The Origin Story

Week 1: Human architect brainstorms during dog walks, chatting with Claude mobile. "What if AI agents could semantically search codebases?"

Friday Aug 30, 2025: First commit at 2:36 PM EDT. Human designs, AI codes. Perfect pair programming.

Labor Day Weekend: Marathon coding session. Tree-sitter parsing, embeddings, vector storage. Human guides architecture, AI implements. No sleep, pure flow state.

Week 2: PostgreSQL state management, MCP server (via our Agenterra tool), comprehensive testing. Refactored everything twice because why not.

Today: Open sourcing as an alpha experiment. Indexing and search work, MCP untested but ready for contributors!

2 weeks. 1 human architect. 1 AI developer. Pure collaboration.

Philosophy

  • Simple > Clever - Boring tech that works
  • Fast > Perfect - Ship iterations, not perfection
  • Local > Cloud - Privacy and performance first
  • Open > Closed - MIT licensed, no vendor lock-in

License

MIT - Use it, fork it, sell it, build a company. We don't care. Just make AI coding better.

πŸ™ Credits

Built with:

  • Rust - The perfect systems language
  • Varios Tree-sitters - Parse all the code
  • Qdrant - Vector database that actually works
  • Jina AI - Using v2-base-code model
    • πŸ”₯ NEW: jina-code-embeddings released Sept 3, 2025! 0.5b/1.5b models trained on code generation - we should upgrade!
  • Agenterra - Our MCP scaffolding tool that generated the server
  • MAOS - Multi-agent orchestration system used in development
  • Claude Code - My AI pair programmer who never sleeps

Special thanks to the MCP team at Anthropic for creating the protocol that makes this possible.


From dog walks to production in 14 days. This is what happens when humans and AI build together. πŸš€

About

No description, website, or topics provided.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •