RAGTAG

What is this?

RAGTAG is an experimental retrieval framework that tests whether explicit structural information improves LLM reliability over text-only retrieval, especially in correctness-sensitive tasks like code understanding and generation.

Rather than asking a model to infer structure implicitly from raw code, RAGTAG makes structure explicit and evaluates how that changes downstream behavior.

Why Explicit Structure?

Large language models operate under strict context and attention limits. This creates a tradeoff: you can include more raw code, or you can include higher-level information that biases the model toward the right parts of the codebase.

In principle, an LLM can infer structure from code alone—but doing so requires repeatedly rediscovering relationships that are already known at retrieval time. RAGTAG shifts this work earlier in the pipeline by explicitly encoding dependency structure in a compact, text-based form. The result is a stronger prior at generation time, with less reliance on inference-time pattern discovery.

The hypothesis is simple: trading prompt space for explicit structure improves reliability, particularly for cross-file reasoning, dependency awareness, and precise code localization.

This project is early-stage and exploratory. Feedback, discussion, and PRs are welcome.

Feature Roadmap

Graph Construction

RAGTAG parses a repository and builds a dependency graph:

Nodes represent functions, classes, or modules
Edges capture dependency relationships (e.g. calls, imports)

Each node stores text attributes such as source code, docstrings, and comments

Initial graph construction lives in py2graph/make_graph.py. The design is intentionally extensible and can be adapted to finer-grained AST-level graphs or coarser file-level graphs.

Text-Based Retrieval with Explicit Structure (WIP)

The codebase graph is serialized into a flat, text-only representation that explicitly encodes structural relationships (e.g. containment, dependency, and call edges). This serialization is injected into the prompt of a tool-calling LLM alongside retrieved code snippets.

We evaluate whether supplying explicit structural priors—rather than relying on the model to infer them implicitly—improves code-generation performance, especially on tasks that require correct localization and multi-file reasoning.

Soft-Token Retrieval with Explicit Structure (Planned)

Flat text descriptions are simple but expensive in context space. As a next step, RAGTAG explores whether soft tokens / prefix tokens can act as a compression mechanism for these precomputed structural descriptions.

The idea is to embed structural summaries and pass them to the model via learned prefix or soft tokens, reducing prompt length while preserving structural bias. Performance will be compared against text-based structure under fixed context budgets.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
notebooks		notebooks
plots		plots
ragtag		ragtag
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGTAG

What is this?

Why Explicit Structure?

Feature Roadmap

Graph Construction

Text-Based Retrieval with Explicit Structure (WIP)

Soft-Token Retrieval with Explicit Structure (Planned)

About

Uh oh!

Releases

Packages

Languages

kompalan/RAGTAG

Folders and files

Latest commit

History

Repository files navigation

RAGTAG

What is this?

Why Explicit Structure?

Feature Roadmap

Graph Construction

Text-Based Retrieval with Explicit Structure (WIP)

Soft-Token Retrieval with Explicit Structure (Planned)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages