Codebase_review Short Description

A modular Python tool designed for auditing large codebases, detecting duplication, identifying structural similarities, clustering files by semantic meaning, and generating comprehensive reports. Built as a training project to automate preliminary code reviews—especially useful when manual inspection is impractical due to scale.

Main Goal: Help filter out redundant, duplicated, or suspicious code to streamline manual review.

Overview

This tool supports (try to) a complete pipeline for static analysis of code repositories. It combines traditional metrics, AST-based similarity, exact and intra-file duplication detection, embedding-based clustering, and LLM-powered semantic analysis (some features are not implemented yet or could work incorrectly)

It was created to assist in reviewing large, complex codebases where manual analysis is time-consuming or infeasible.

Installation

Now it available only via CLI interface.

At first, to run this module, you need to configure module with setup.py in the root of the project:

pip install -e .

Usage

After installation, you can use it via CLI with following command:

python src/codebase_review/cli.py run-all path/to/project/to/review

This will run all review tools available in the project. At same time, it will create directory with result files in root directory (by default) repo_audit_output where output files will be stored.

Commands

Now you can use different commands to run different review tools separately or run all of them completely. Some elements don't work correctly yet :(. I hope to fix it as soon as possible.

Available run arguments are listed below:

Command	Description
`inventory`	Scans the given directory and generates a CSV file with file inventory (path, size, extension, etc.).
`metrics`	Collects code metrics (e.g. lines, complexity) from the inventory. Optionally runs lightweight semantic analysis using Ollama (if specified).
`duplicates`	Finds exact duplicate files based on content hash.
`astsim`	Detects structurally similar code using AST comparison. Threshold controls similarity sensitivity (0.0–1.0). Sampling limits performance impact.
`intrafile`	Detects duplicated code blocks within single files using `jscpd`.
`quarantine`	Identifies potentially problematic files (e.g., auto-generated, large, or suspicious) and logs them.
`embed`	Generates semantic embeddings for source files using a SentenceTransformer model.
`cluster`	Clusters files based on their embeddings to find functionally similar components.
`semantic`	Runs semantic analysis using an Ollama-compatible LLM to classify or describe files.
`report`	Generates a comprehensive Markdown report summarizing findings from all previous steps.
`run-all`	Runs the complete pipeline: inventory → metrics → duplicates → AST sim → intrafile → quarantine → embeddings/clustering (if available) → semantic (if model given) → report.

Available optional arguments are listed below:

Command	Description
`--output`	Path to output directory. Default is `repo_audit_output` in place of launch.
`--ollama-model`	Model name for semantic analysis. Could be used with `metrics`, `semantic` & `run-all` commands. Default is "Qwen3-4b"
`--limit`	Limit for number of files to process. Could be used with `metrics` & `run-all` commands. Default is `0` (no limit).
`--batch-size`	Batch size for LLM processing. Could be used with `semantic` command. Default is `8`.
`--threshold`	Threshold for similarity detection. Could be used with `astsim` command. Default is `0.5`.
`--sample`	Sample size for similarity detection. Could be used with `astsim` command. Default is `500`.
`--model`	Model name for embedding generation. Could be used with `embed` command. Default is `all-MiniLM-L6-v2`.

Important Dependencies

Now, this module depends on external tools: ollama & jscpd that should be installed separately before using. Without installed ollama or jscpd components on system, some review tools will not work completely. jscpd is used for duplicate detection inside single file and could be installed via npm using: npm install -g jscpd command. ollama is used for semantic analysis and could be installed following: oficial repository instructions using any preferred way.

Output structure

Output structure after running run-all command should be like this:

./repo_audit_output/
├── ast_similarity.csv
├── clusters.csv
├── duplicates.csv
├── embeddings.npz
├── intrafile_duplication.json
│   └── jscpd-report.json
├── inventory.csv
├── metrics.csv
├── ollama_analysis.jsonl
├── quarantine.csv
└── report.md

Each file contains information about specific review step. Files named equal to review tool name.

Known Limitations

Some modules may not work correctly yet. Work in progress.
ollama and jscpd are required for full functionality but are not managed by pip.

Future Plans (Suggested)

Improve logging control. Save logs to separate log file.
Refactor code to use .yaml config files for settings and variables instead of config.py file.
Append dependencies validation on startup for external tools (ollama, jscpd).
Improve code quality and documentation.
Integrate LLM-md summary generation.
Integrate ollama model stopping after completion.

Important Notes

AI-assisted tools were used during development for code explanation, review, and minor refactoring and grammar correction. However, all logic and architecture are authored manually.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
archive		archive
src/codebase_review		src/codebase_review
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebase_review Short Description

Overview

Installation

Usage

Commands

Available run arguments are listed below:

Available optional arguments are listed below:

Important Dependencies

Output structure

Known Limitations

Future Plans (Suggested)

Important Notes

About

Uh oh!

Releases

Packages

Languages

kirillkrs/codebase_review

Folders and files

Latest commit

History

Repository files navigation

Codebase_review Short Description

Overview

Installation

Usage

Commands

Available run arguments are listed below:

Available optional arguments are listed below:

Important Dependencies

Output structure

Known Limitations

Future Plans (Suggested)

Important Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages