Code Indexing Guide

Code Indexing Deep Dive

Agent Brain's AST-aware code indexing is what sets it apart from generic RAG solutions. This guide explains how code is processed, what metadata is extracted, and how to get the best results from code-aware search.

Why AST-Aware Indexing?

The Problem with Text-Based Chunking

Generic RAG systems split code like any other text. This creates problems:

# Text-based chunking might split here ---v
def authenticate_user(username: str, password: str) -> User:
    """Authenticate a user against the database."""
    user = db.get_user(username)
    if not user:
        raise AuthenticationError("User not found")
# ---^ And continue the function in another chunk
    if not verify_password(password, user.password_hash):
        raise AuthenticationError("Invalid password")
    return user

Problems:

Function split mid-body loses semantic coherence
Docstrings separated from their functions
Queries for "authenticate_user" may not find complete implementation
Symbol metadata (name, signature) not available

AST-Aware Solution

Agent Brain uses tree-sitter parsers to understand code structure:

# AST parser identifies function boundaries
def authenticate_user(username: str, password: str) -> User:
    """Authenticate a user against the database."""
    user = db.get_user(username)
    if not user:
        raise AuthenticationError("User not found")
    if not verify_password(password, user.password_hash):
        raise AuthenticationError("Invalid password")
    return user
# ^^^ Entire function stays in one chunk ^^^

Advantages:

Complete functions/classes in single chunks
Rich metadata extraction (name, kind, line numbers)
Better search relevance
Enables structural queries

Supported Languages

Agent Brain supports AST-aware chunking for 11 programming languages:

Language	Extensions	Symbol Types Extracted
Python	.py, .pyw, .pyi	functions, classes, methods
TypeScript	.ts, .tsx	functions, classes, methods, arrow functions
JavaScript	.js, .jsx, .mjs, .cjs	functions, classes, methods, arrow functions
Java	.java	classes, methods, interfaces
Go	.go	functions, methods, types
Rust	.rs	functions, impl blocks, structs, traits
C	.c, .h	functions
C++	.cpp, .cc, .hpp	functions, classes, methods
C#	.cs, .csx	classes, methods, interfaces, properties, records
Kotlin	.kt, .kts	functions, classes
Swift	.swift	functions, classes

Language Detection

Languages are detected automatically via:

File Extension (primary): .py -> Python, .ts -> TypeScript
Content Analysis (fallback): Pattern matching for language-specific syntax

# From document_loader.py
EXTENSION_TO_LANGUAGE = {
    ".py": "python",
    ".ts": "typescript",
    ".tsx": "typescript",
    ".js": "javascript",
    ".java": "java",
    ".go": "go",
    ".rs": "rust",
    ".cs": "csharp",
    # ... and more
}

The Indexing Pipeline

Overview

Source Files --> Document Loader --> Language Detection
       |                                    |
       v                                    v
  LoadedDocument          +----------------+----------------+
       |                  |                                 |
       v                  v                                 v
  Code Files        Documentation Files
       |                  |
       v                  v
  CodeChunker        ContextAwareChunker
  (tree-sitter)      (sentence/paragraph)
       |                  |
       v                  v
  CodeChunk[]         TextChunk[]
       |                  |
       +--------+---------+
                |
                v
        EmbeddingGenerator
                |
                v
        Vector Store + BM25 Index + Graph Index

Step 1: Document Loading

The DocumentLoader identifies code files and extracts initial metadata:

loaded_doc = LoadedDocument(
    text=file_content,
    source=file_path,
    file_name="auth.py",
    metadata={
        "source_type": "code",
        "language": "python",
        "file_size": 2048,
    }
)

Step 2: Code Chunking

The CodeChunker uses LlamaIndex's CodeSplitter with tree-sitter parsing:

# Configuration
code_chunker = CodeChunker(
    language="python",
    chunk_lines=40,           # Target chunk size in lines
    chunk_lines_overlap=15,   # Overlap between chunks
    max_chars=1500,           # Maximum characters per chunk
    generate_summaries=False, # Optional LLM summaries
)

Chunking Strategy:

Parse code into AST using tree-sitter
Identify top-level symbols (functions, classes)
Split at symbol boundaries
Preserve context with configurable overlap

Step 3: Symbol Extraction

For each chunk, the system extracts symbol metadata:

# Tree-sitter query for Python
query = """
    (function_definition
      name: (identifier) @name) @symbol
    (class_definition
      name: (identifier) @name) @symbol
"""

# Extracted symbols
symbols = [
    {"name": "authenticate_user", "kind": "function_definition",
     "start_line": 10, "end_line": 25},
    {"name": "User", "kind": "class_definition",
     "start_line": 1, "end_line": 8},
]

Step 4: Metadata Enrichment

Each code chunk receives rich metadata:

chunk = CodeChunk(
    chunk_id="chunk_abc123",
    text="def authenticate_user(...): ...",
    source="/path/to/auth.py",
    chunk_index=0,
    total_chunks=5,
    token_count=150,
    metadata=ChunkMetadata(
        source_type="code",
        language="python",
        symbol_name="authenticate_user",
        symbol_kind="function_definition",
        start_line=10,
        end_line=25,
        docstring="Authenticate a user against the database.",
    )
)

Chunk Metadata

Universal Metadata (All Chunks)

Field	Description	Example
`chunk_id`	Unique identifier	`chunk_abc123`
`source`	File path	`/project/src/auth.py`
`file_name`	File name	`auth.py`
`chunk_index`	Position in document	`0`
`total_chunks`	Chunks from this file	`5`
`source_type`	Content type	`"code"` or `"doc"`
`created_at`	Indexing timestamp	`2024-01-15T10:30:00`

Code-Specific Metadata

Field	Description	Example
`language`	Programming language	`"python"`
`symbol_name`	Function/class name	`"authenticate_user"`
`symbol_kind`	Symbol type	`"function_definition"`
`start_line`	Starting line number (1-based)	`10`
`end_line`	Ending line number	`25`
`docstring`	Extracted documentation	`"Authenticate a user..."`
`parameters`	Function parameters	`["username", "password"]`
`return_type`	Return type annotation	`"User"`
`decorators`	Applied decorators	`["@login_required"]`
`imports`	Import statements	`["jwt", "bcrypt"]`

C# Special Handling

C# files receive special treatment for XML documentation:

/// <summary>
/// Authenticates a user against the database.
/// </summary>
/// <param name="username">The username to authenticate.</param>
/// <returns>The authenticated user.</returns>
public User AuthenticateUser(string username, string password)
{
    // Implementation
}

Extracted metadata:

{
    "docstring": "Authenticates a user against the database. The username to authenticate. The authenticated user.",
    "symbol_name": "AuthenticateUser",
    "symbol_kind": "method_declaration"
}

Code-Specific Queries

Filtering by Source Type

Search only code or only documentation:

# Code only
agent-brain query "database connection" --source-types code

# Documentation only
agent-brain query "installation guide" --source-types doc

# Both (default)
agent-brain query "authentication"

Filtering by Language

Search specific programming languages:

# Python only
agent-brain query "error handling" --languages python

# Multiple languages
agent-brain query "API client" --languages python,typescript

# Combined filters
agent-brain query "dependency injection" --source-types code --languages java,kotlin

Symbol-Aware Queries

Leverage symbol metadata for precise results:

# Find function definitions
agent-brain query "function authenticate" --mode bm25

# Find class implementations
agent-brain query "class UserController" --mode hybrid

# Find imports
agent-brain query "import jwt" --mode bm25 --source-types code

File Path Filtering

Target specific directories or files:

# Search in specific directory
agent-brain query "config" --file-paths "src/config/**"

# Multiple patterns
agent-brain query "tests" --file-paths "tests/**/*.py,spec/**/*.ts"

Best Practices

1. Include Code During Indexing

Always use --include-code when indexing codebases:

agent-brain index /path/to/project --include-code

Without this flag, only documentation files are indexed.

2. Choose the Right Search Mode

Query Type	Recommended Mode	Example
Function name	`bm25`	`"authenticate_user"`
Class with description	`hybrid`	`"authentication class"`
Concept explanation	`vector`	`"how does caching work"`
Dependencies	`graph`	`"what imports jwt"`

3. Use Language Filters for Precision

When you know the target language, filter to reduce noise:

# More precise results
agent-brain query "router setup" --languages typescript

# vs. searching all languages (more noise)
agent-brain query "router setup"

4. Leverage BM25 for Exact Matches

Function and class names work best with BM25:

# BM25 for exact symbol names
agent-brain query "VectorStoreManager" --mode bm25

# Hybrid for described functionality
agent-brain query "manages vector storage" --mode hybrid

5. Generate Summaries for Better Semantic Search

Enable LLM summaries for improved concept matching:

agent-brain index /project --include-code --generate-summaries

Trade-off: Adds ~50% to indexing time but improves vector search relevance.

6. Tune Chunk Sizes for Your Codebase

Adjust chunk parameters for different code styles:

# Larger chunks for verbose languages (Java, C#)
agent-brain index /project --chunk-size 800 --overlap 100

# Smaller chunks for concise languages (Python, Go)
agent-brain index /project --chunk-size 400 --overlap 50

Troubleshooting

"No results" for Code Queries

Symptoms: Queries return empty results despite indexed code.

Solutions:

Verify code was indexed: agent-brain status should show total_code_chunks > 0
Check language filter: Ensure your language is indexed
Lower threshold: Try --threshold 0.3
Try BM25 mode for exact terms: --mode bm25

Incomplete Function Chunks

Symptoms: Functions appear split across multiple chunks.

Possible Causes:

Very long functions exceed max_chars
Tree-sitter parser not available for language

Solutions:

Increase max_chars: Default is 1500, try 3000 for long functions
Verify language support: Check LanguageDetector.get_supported_languages()

Wrong Symbol Assigned to Chunk

Symptoms: Chunk metadata shows incorrect symbol name.

Explanation: When a chunk spans multiple symbols, the system assigns the most specific (deepest nested) symbol that overlaps with the chunk.

Solutions:

This is expected behavior for boundary chunks
Use file path filtering for precision
Enable GraphRAG for relationship-based queries

Language Detection Failures

Symptoms: Files indexed with wrong language or as documentation.

Solutions:

Check file extension matches expected pattern
Rename files to use standard extensions
Verify content patterns match language (for fallback detection)

Slow Code Indexing

Symptoms: Indexing takes much longer than expected.

Causes:

LLM summary generation enabled
Large files with complex AST
Many small files (overhead per file)

Solutions:

Disable summaries: --no-generate-summaries
Exclude generated files: Use .gitignore patterns
Index in batches: Split large codebases

Memory Issues During Indexing

Symptoms: Out of memory errors during code indexing.

Solutions:

Reduce batch size: Modify chroma_batch_size in indexing service
Exclude large binary files
Index fewer languages at once

Advanced Topics

Custom Tree-Sitter Queries

Agent Brain uses language-specific tree-sitter queries for symbol extraction. The queries are defined in chunking.py:

# Python query
query_str = """
    (function_definition
      name: (identifier) @name) @symbol
    (class_definition
      name: (identifier) @name) @symbol
"""

# TypeScript query
query_str = """
    (function_declaration
      name: (identifier) @name) @symbol
    (class_declaration
      name: (type_identifier) @name) @symbol
    (variable_declarator
      name: (identifier) @name
      value: (arrow_function)) @symbol
"""

Code Indexing Guide

Code Indexing Deep Dive

Table of Contents

Why AST-Aware Indexing?

The Problem with Text-Based Chunking

AST-Aware Solution

Supported Languages

Language Detection

The Indexing Pipeline

Overview

Step 1: Document Loading

Step 2: Code Chunking

Step 3: Symbol Extraction

Step 4: Metadata Enrichment

Chunk Metadata

Universal Metadata (All Chunks)

Code-Specific Metadata

C# Special Handling

Code-Specific Queries

Filtering by Source Type

Filtering by Language

Symbol-Aware Queries

File Path Filtering

Best Practices

1. Include Code During Indexing

2. Choose the Right Search Mode

3. Use Language Filters for Precision

4. Leverage BM25 for Exact Matches

5. Generate Summaries for Better Semantic Search

6. Tune Chunk Sizes for Your Codebase

Troubleshooting

"No results" for Code Queries

Incomplete Function Chunks

Wrong Symbol Assigned to Chunk

Language Detection Failures

Slow Code Indexing

Memory Issues During Indexing

Advanced Topics

Custom Tree-Sitter Queries

Integration with GraphRAG

Extending Language Support

Next Steps

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!