Semantic code search for every AI coding agent. Built in Rust. MCP-native.
Give your AI perfect memory of your entire codebase
Codetriever is a semantic code search engine with MCP support. The goal is to give AI agents memory of codebases through the Model Context Protocol.
# These commands exist but may not work:
codetriever mcp # MCP server (untested, may crash)
# These commands don't exist yet:
codetriever index /path/to/repo
codetriever search "database connection pooling logic"- Context windows overflow - Even 200k tokens can't fit real codebases
- AI agents forget - They can't see your whole project structure
- Keyword search fails - "auth logic" should find authentication code regardless of naming
- Cloud is risky - Your proprietary code shouldn't leave your machine
Every AI coding tool needs semantic search. We're building the open protocol that powers them all through MCP. Not locked to Claude, Copilot, or Cursor - works with everything.
- Indexing pipeline - Parse, chunk, embed, store your codebase
- Tree-sitter parsing - Semantic understanding of 25+ languages
- Smart chunking - Respects token limits, preserves context
- Vector storage - Qdrant for embeddings, PostgreSQL for metadata
- Database tracking - Knows what files have been indexed
- REST API - Working indexing and search endpoints
- Semantic search - Returns ranked results from vector database
- MCP server - Agenterra scaffolded it, never tested with Claude
- Incremental updates - Code exists, not proven
- CLI commands - Exist but may have bugs
- Additional API endpoints
- /similar - Find similar code chunks
- /context - Get surrounding code context
- /usages - Find symbol usages
- /status - System health and metrics
- /stats - Quick statistics
- /clean - Remove stale entries
- /compact - Optimize database
- CLI commands - Direct terminal access to all features
- Similar code finder - Find code patterns across your codebase
- Usage finder - Track where symbols are used
See LIMITATIONS.md for known issues, hardware requirements, and missing features.
Your Code β Tree-sitter Parser β Semantic Chunks β Vector Embeddings β Qdrant
β β
File Tracking ββββββββββββββββββββββββββββββββββββββββββββββββββββββ Search
(PostgreSQL) (MCP/API)
- RAM: 16GB (4GB available for embeddings)
- CPU: Any x64 or ARM64 processor
- Disk: 2GB for models + space for index
- OS: macOS, Linux, Windows (WSL2)
- RAM: 32GB+
- CPU: Apple Silicon (M1/M2/M3) or modern x64 with AVX
- GPU: NVIDIA with CUDA or Mac with Metal
- Disk: SSD with 10GB+ free
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Install Just (command runner)
cargo install just
# or on macOS
brew install just
# Docker (for PostgreSQL and Qdrant)
# Install Docker Desktop from https://www.docker.com/products/docker-desktopgit clone https://github.com/clafollett/codetriever
cd codetriever
# Setup development environment
source stack.env
just dev-setup
# Build and install
cargo install --path crates/codetriever
cargo install --path crates/codetriever-api# Initialize Docker services and database
just init
# Start API server
codetriever-api
# Index via API (requires file CONTENT, not filesystem paths - SaaS-ready!)
# Path should be repo-relative (e.g., "src/main.rs" not "/Users/bob/code/project/src/main.rs")
curl -X POST http://localhost:8080/index \
-H "Content-Type: application/json" \
-d '{
"project_id": "my-project",
"files": [
{
"path": "src/main.rs",
"content": "fn main() { println!(\"Hello\"); }"
}
]
}'
# Search via API (works! returns semantic results)
curl -X POST http://localhost:8080/search \
-H "Content-Type: application/json" \
-d '{"query": "function that prints hello", "limit": 10}'
# MCP server (exists but untested, probably broken)
codetriever mcp# Initial setup
just dev-setup # Install dependencies and setup environment
source stack.env # Load development environment
# Infrastructure (Docker services)
just init # Initialize Docker services and database
just docker-up # Start PostgreSQL and Qdrant
just docker-down # Stop all containers
just docker-reset # Clean reset of Docker environment
just docker-logs # View service logs
# Database
just db-setup # Initialize database schema
just db-migrate # Run migrations
just db-reset # Drop and recreate database
# Development workflow
just test # Run all tests
just test-unit # Run unit tests only (fast)
just test-integration # Run integration tests
just fmt # Format code
just lint # Run clippy lints
just clippy-fix # Fix clippy warnings
just check # Run all quality checks (fmt + lint + test)
just watch # Watch mode for development
# Building & Running
just build # Build debug version
just build-release # Build optimized release
just run [args] # Run CLI with arguments
just api # Run API server
# Utility
just clean # Clean build artifacts
just docs # Generate and open documentation
just stats # Show project statistics
just update # Update dependencies
just audit # Security audit
just clean-test-data # Clean Qdrant test collections# Quick setup for new contributors
just quick-start # Runs init + test
# Full CI pipeline locally
just ci # Runs fmt + lint + test + build
# Fix all auto-fixable issues
just fix # Runs fmt + clippy-fix
# Development mode with auto-reload
just dev # Starts Docker and watches API# Run all tests
just test
# Run unit tests only (faster)
just test-unit
# Run integration tests
just test-integration
# Run specific crate tests
cargo test -p codetriever-indexer
cargo test -p codetriever-meta-data- Unit Tests - Comprehensive coverage with mocks
- Integration Tests - Full stack testing with real components
- Token Counter Tests - Accuracy and performance validation
- Byte Offset Tests - Proper position tracking
- Trait-based abstractions -
VectorStorage,EmbeddingService,ContentParser,TokenCounter - Token-aware chunking - Respects model context limits
- Deterministic chunk IDs - UUID v5 based on content for stability
- Incremental indexing - Git-aware change detection
- Modular architecture - Clean separation of concerns
π§ Alpha - Core functionality is working, API stabilizing.
- Modular crate structure
- Tree-sitter parsing for 25+ languages
- Vector storage with Qdrant
- PostgreSQL state management
- Token counting abstractions (Tiktoken, Heuristic)
- Smart chunking service
- REST API with Axum
- Incremental indexing
- Comprehensive test suite
- Working semantic search API
- File content indexing API
- Performance optimization
- CLI improvements
- Documentation
- MCP server implementation
- Git integration for history
- Multiple embedding models
- Web UI
- Language-specific improvements
See docs/architecture/current-architecture.md for detailed system design.
Want to make AI coding better? π¦Έ
- Add CLI commands for search/similar/context
- Upgrade to NEW Jina code models (released Sept 3, 2025! 0.5b/1.5b/GGUF versions)
- Improve error messages
- Add more language tests
- Write documentation
- Test MCP server integration
- Fork and clone the repository
- Pick a TODO from the codebase (they're everywhere!)
- Write tests first - We use TDD (Red/Green/Refactor)
- Run checks -
just test && just clippy-fix - Submit a PR - We review fast!
See CONTRIBUTING.md for setup details.
First PR merged gets a shoutout in the README! π
Week 1: Human architect brainstorms during dog walks, chatting with Claude mobile. "What if AI agents could semantically search codebases?"
Friday Aug 30, 2025: First commit at 2:36 PM EDT. Human designs, AI codes. Perfect pair programming.
Labor Day Weekend: Marathon coding session. Tree-sitter parsing, embeddings, vector storage. Human guides architecture, AI implements. No sleep, pure flow state.
Week 2: PostgreSQL state management, MCP server (via our Agenterra tool), comprehensive testing. Refactored everything twice because why not.
Today: Open sourcing as an alpha experiment. Indexing and search work, MCP untested but ready for contributors!
2 weeks. 1 human architect. 1 AI developer. Pure collaboration.
- Simple > Clever - Boring tech that works
- Fast > Perfect - Ship iterations, not perfection
- Local > Cloud - Privacy and performance first
- Open > Closed - MIT licensed, no vendor lock-in
MIT - Use it, fork it, sell it, build a company. We don't care. Just make AI coding better.
Built with:
- Rust - The perfect systems language
- Varios Tree-sitters - Parse all the code
- Qdrant - Vector database that actually works
- Jina AI - Using v2-base-code model
- π₯ NEW: jina-code-embeddings released Sept 3, 2025! 0.5b/1.5b models trained on code generation - we should upgrade!
- Agenterra - Our MCP scaffolding tool that generated the server
- MAOS - Multi-agent orchestration system used in development
- Claude Code - My AI pair programmer who never sleeps
Special thanks to the MCP team at Anthropic for creating the protocol that makes this possible.
From dog walks to production in 14 days. This is what happens when humans and AI build together. π