-
Notifications
You must be signed in to change notification settings - Fork 2
Code Indexing Guide
Agent Brain's AST-aware code indexing is what sets it apart from generic RAG solutions. This guide explains how code is processed, what metadata is extracted, and how to get the best results from code-aware search.
- Why AST-Aware Indexing?
- Supported Languages
- The Indexing Pipeline
- Chunk Metadata
- Code-Specific Queries
- Best Practices
- Troubleshooting
Generic RAG systems split code like any other text. This creates problems:
# Text-based chunking might split here ---v
def authenticate_user(username: str, password: str) -> User:
"""Authenticate a user against the database."""
user = db.get_user(username)
if not user:
raise AuthenticationError("User not found")
# ---^ And continue the function in another chunk
if not verify_password(password, user.password_hash):
raise AuthenticationError("Invalid password")
return userProblems:
- Function split mid-body loses semantic coherence
- Docstrings separated from their functions
- Queries for "authenticate_user" may not find complete implementation
- Symbol metadata (name, signature) not available
Agent Brain uses tree-sitter parsers to understand code structure:
# AST parser identifies function boundaries
def authenticate_user(username: str, password: str) -> User:
"""Authenticate a user against the database."""
user = db.get_user(username)
if not user:
raise AuthenticationError("User not found")
if not verify_password(password, user.password_hash):
raise AuthenticationError("Invalid password")
return user
# ^^^ Entire function stays in one chunk ^^^Advantages:
- Complete functions/classes in single chunks
- Rich metadata extraction (name, kind, line numbers)
- Better search relevance
- Enables structural queries
Agent Brain supports AST-aware chunking for 11 programming languages:
| Language | Extensions | Symbol Types Extracted |
|---|---|---|
| Python | .py, .pyw, .pyi | functions, classes, methods |
| TypeScript | .ts, .tsx | functions, classes, methods, arrow functions |
| JavaScript | .js, .jsx, .mjs, .cjs | functions, classes, methods, arrow functions |
| Java | .java | classes, methods, interfaces |
| Go | .go | functions, methods, types |
| Rust | .rs | functions, impl blocks, structs, traits |
| C | .c, .h | functions |
| C++ | .cpp, .cc, .hpp | functions, classes, methods |
| C# | .cs, .csx | classes, methods, interfaces, properties, records |
| Kotlin | .kt, .kts | functions, classes |
| Swift | .swift | functions, classes |
Languages are detected automatically via:
-
File Extension (primary):
.py-> Python,.ts-> TypeScript - Content Analysis (fallback): Pattern matching for language-specific syntax
# From document_loader.py
EXTENSION_TO_LANGUAGE = {
".py": "python",
".ts": "typescript",
".tsx": "typescript",
".js": "javascript",
".java": "java",
".go": "go",
".rs": "rust",
".cs": "csharp",
# ... and more
}Source Files --> Document Loader --> Language Detection
| |
v v
LoadedDocument +----------------+----------------+
| | |
v v v
Code Files Documentation Files
| |
v v
CodeChunker ContextAwareChunker
(tree-sitter) (sentence/paragraph)
| |
v v
CodeChunk[] TextChunk[]
| |
+--------+---------+
|
v
EmbeddingGenerator
|
v
Vector Store + BM25 Index + Graph Index
The DocumentLoader identifies code files and extracts initial metadata:
loaded_doc = LoadedDocument(
text=file_content,
source=file_path,
file_name="auth.py",
metadata={
"source_type": "code",
"language": "python",
"file_size": 2048,
}
)The CodeChunker uses LlamaIndex's CodeSplitter with tree-sitter parsing:
# Configuration
code_chunker = CodeChunker(
language="python",
chunk_lines=40, # Target chunk size in lines
chunk_lines_overlap=15, # Overlap between chunks
max_chars=1500, # Maximum characters per chunk
generate_summaries=False, # Optional LLM summaries
)Chunking Strategy:
- Parse code into AST using tree-sitter
- Identify top-level symbols (functions, classes)
- Split at symbol boundaries
- Preserve context with configurable overlap
For each chunk, the system extracts symbol metadata:
# Tree-sitter query for Python
query = """
(function_definition
name: (identifier) @name) @symbol
(class_definition
name: (identifier) @name) @symbol
"""
# Extracted symbols
symbols = [
{"name": "authenticate_user", "kind": "function_definition",
"start_line": 10, "end_line": 25},
{"name": "User", "kind": "class_definition",
"start_line": 1, "end_line": 8},
]Each code chunk receives rich metadata:
chunk = CodeChunk(
chunk_id="chunk_abc123",
text="def authenticate_user(...): ...",
source="/path/to/auth.py",
chunk_index=0,
total_chunks=5,
token_count=150,
metadata=ChunkMetadata(
source_type="code",
language="python",
symbol_name="authenticate_user",
symbol_kind="function_definition",
start_line=10,
end_line=25,
docstring="Authenticate a user against the database.",
)
)| Field | Description | Example |
|---|---|---|
chunk_id |
Unique identifier | chunk_abc123 |
source |
File path | /project/src/auth.py |
file_name |
File name | auth.py |
chunk_index |
Position in document | 0 |
total_chunks |
Chunks from this file | 5 |
source_type |
Content type |
"code" or "doc"
|
created_at |
Indexing timestamp | 2024-01-15T10:30:00 |
| Field | Description | Example |
|---|---|---|
language |
Programming language | "python" |
symbol_name |
Function/class name | "authenticate_user" |
symbol_kind |
Symbol type | "function_definition" |
start_line |
Starting line number (1-based) | 10 |
end_line |
Ending line number | 25 |
docstring |
Extracted documentation | "Authenticate a user..." |
parameters |
Function parameters | ["username", "password"] |
return_type |
Return type annotation | "User" |
decorators |
Applied decorators | ["@login_required"] |
imports |
Import statements | ["jwt", "bcrypt"] |
C# files receive special treatment for XML documentation:
/// <summary>
/// Authenticates a user against the database.
/// </summary>
/// <param name="username">The username to authenticate.</param>
/// <returns>The authenticated user.</returns>
public User AuthenticateUser(string username, string password)
{
// Implementation
}Extracted metadata:
{
"docstring": "Authenticates a user against the database. The username to authenticate. The authenticated user.",
"symbol_name": "AuthenticateUser",
"symbol_kind": "method_declaration"
}Search only code or only documentation:
# Code only
agent-brain query "database connection" --source-types code
# Documentation only
agent-brain query "installation guide" --source-types doc
# Both (default)
agent-brain query "authentication"Search specific programming languages:
# Python only
agent-brain query "error handling" --languages python
# Multiple languages
agent-brain query "API client" --languages python,typescript
# Combined filters
agent-brain query "dependency injection" --source-types code --languages java,kotlinLeverage symbol metadata for precise results:
# Find function definitions
agent-brain query "function authenticate" --mode bm25
# Find class implementations
agent-brain query "class UserController" --mode hybrid
# Find imports
agent-brain query "import jwt" --mode bm25 --source-types codeTarget specific directories or files:
# Search in specific directory
agent-brain query "config" --file-paths "src/config/**"
# Multiple patterns
agent-brain query "tests" --file-paths "tests/**/*.py,spec/**/*.ts"Always use --include-code when indexing codebases:
agent-brain index /path/to/project --include-codeWithout this flag, only documentation files are indexed.
| Query Type | Recommended Mode | Example |
|---|---|---|
| Function name | bm25 |
"authenticate_user" |
| Class with description | hybrid |
"authentication class" |
| Concept explanation | vector |
"how does caching work" |
| Dependencies | graph |
"what imports jwt" |
When you know the target language, filter to reduce noise:
# More precise results
agent-brain query "router setup" --languages typescript
# vs. searching all languages (more noise)
agent-brain query "router setup"Function and class names work best with BM25:
# BM25 for exact symbol names
agent-brain query "VectorStoreManager" --mode bm25
# Hybrid for described functionality
agent-brain query "manages vector storage" --mode hybridEnable LLM summaries for improved concept matching:
agent-brain index /project --include-code --generate-summariesTrade-off: Adds ~50% to indexing time but improves vector search relevance.
Adjust chunk parameters for different code styles:
# Larger chunks for verbose languages (Java, C#)
agent-brain index /project --chunk-size 800 --overlap 100
# Smaller chunks for concise languages (Python, Go)
agent-brain index /project --chunk-size 400 --overlap 50Symptoms: Queries return empty results despite indexed code.
Solutions:
- Verify code was indexed:
agent-brain statusshould showtotal_code_chunks > 0 - Check language filter: Ensure your language is indexed
- Lower threshold: Try
--threshold 0.3 - Try BM25 mode for exact terms:
--mode bm25
Symptoms: Functions appear split across multiple chunks.
Possible Causes:
- Very long functions exceed
max_chars - Tree-sitter parser not available for language
Solutions:
- Increase max_chars: Default is 1500, try 3000 for long functions
- Verify language support: Check
LanguageDetector.get_supported_languages()
Symptoms: Chunk metadata shows incorrect symbol name.
Explanation: When a chunk spans multiple symbols, the system assigns the most specific (deepest nested) symbol that overlaps with the chunk.
Solutions:
- This is expected behavior for boundary chunks
- Use file path filtering for precision
- Enable GraphRAG for relationship-based queries
Symptoms: Files indexed with wrong language or as documentation.
Solutions:
- Check file extension matches expected pattern
- Rename files to use standard extensions
- Verify content patterns match language (for fallback detection)
Symptoms: Indexing takes much longer than expected.
Causes:
- LLM summary generation enabled
- Large files with complex AST
- Many small files (overhead per file)
Solutions:
- Disable summaries:
--no-generate-summaries - Exclude generated files: Use
.gitignorepatterns - Index in batches: Split large codebases
Symptoms: Out of memory errors during code indexing.
Solutions:
- Reduce batch size: Modify
chroma_batch_sizein indexing service - Exclude large binary files
- Index fewer languages at once
Agent Brain uses language-specific tree-sitter queries for symbol extraction. The queries are defined in chunking.py:
# Python query
query_str = """
(function_definition
name: (identifier) @name) @symbol
(class_definition
name: (identifier) @name) @symbol
"""
# TypeScript query
query_str = """
(function_declaration
name: (identifier) @name) @symbol
(class_declaration
name: (type_identifier) @name) @symbol
(variable_declarator
name: (identifier) @name
value: (arrow_function)) @symbol
"""Code metadata feeds directly into GraphRAG:
-
Import relationships:
auth_module --[imports]--> jwt -
Containment:
UserController --[contains]--> login -
Definition locations:
authenticate --[defined_in]--> auth.py
See GraphRAG Integration Guide for details.
To add a new language:
- Add extension mapping in
LanguageDetector.EXTENSION_TO_LANGUAGE - Add content patterns in
LanguageDetector.CONTENT_PATTERNS - Add tree-sitter query in
CodeChunker._get_symbols() - Update
CodeChunker._setup_language()for parser initialization
- GraphRAG Integration Guide - How code metadata feeds knowledge graphs
- API Reference - Code indexing API endpoints
- Configuration Reference - Chunking configuration options
- Design-Architecture-Overview
- Design-Query-Architecture
- Design-Storage-Architecture
- Design-Class-Diagrams
- GraphRAG-Guide
- Agent-Skill-Hybrid-Search-Guide
- Agent-Skill-Graph-Search-Guide
- Agent-Skill-Vector-Search-Guide
- Agent-Skill-BM25-Search-Guide
Search
Server
Setup
- Pluggable-Providers-Spec
- GraphRAG-Integration-Spec
- Agent-Brain-Plugin-Spec
- Multi-Instance-Architecture-Spec