A modular Python tool designed for auditing large codebases, detecting duplication, identifying structural similarities, clustering files by semantic meaning, and generating comprehensive reports. Built as a training project to automate preliminary code reviews—especially useful when manual inspection is impractical due to scale.
Main Goal: Help filter out redundant, duplicated, or suspicious code to streamline manual review.
This tool supports (try to) a complete pipeline for static analysis of code repositories. It combines traditional metrics, AST-based similarity, exact and intra-file duplication detection, embedding-based clustering, and LLM-powered semantic analysis (some features are not implemented yet or could work incorrectly)
It was created to assist in reviewing large, complex codebases where manual analysis is time-consuming or infeasible.
Now it available only via CLI interface.
At first, to run this module, you need to configure module with setup.py in the root of the project:
pip install -e .After installation, you can use it via CLI with following command:
python src/codebase_review/cli.py run-all path/to/project/to/reviewThis will run all review tools available in the project. At same time, it will create directory with result files in root directory (by default) repo_audit_output where output files will be stored.
Now you can use different commands to run different review tools separately or run all of them completely. Some elements don't work correctly yet :(. I hope to fix it as soon as possible.
| Command | Description |
|---|---|
inventory |
Scans the given directory and generates a CSV file with file inventory (path, size, extension, etc.). |
metrics |
Collects code metrics (e.g. lines, complexity) from the inventory. Optionally runs lightweight semantic analysis using Ollama (if specified). |
duplicates |
Finds exact duplicate files based on content hash. |
astsim |
Detects structurally similar code using AST comparison. Threshold controls similarity sensitivity (0.0–1.0). Sampling limits performance impact. |
intrafile |
Detects duplicated code blocks within single files using jscpd. |
quarantine |
Identifies potentially problematic files (e.g., auto-generated, large, or suspicious) and logs them. |
embed |
Generates semantic embeddings for source files using a SentenceTransformer model. |
cluster |
Clusters files based on their embeddings to find functionally similar components. |
semantic |
Runs semantic analysis using an Ollama-compatible LLM to classify or describe files. |
report |
Generates a comprehensive Markdown report summarizing findings from all previous steps. |
run-all |
Runs the complete pipeline: inventory → metrics → duplicates → AST sim → intrafile → quarantine → embeddings/clustering (if available) → semantic (if model given) → report. |
| Command | Description |
|---|---|
--output |
Path to output directory. Default is repo_audit_output in place of launch. |
--ollama-model |
Model name for semantic analysis. Could be used with metrics, semantic & run-all commands. Default is "Qwen3-4b" |
--limit |
Limit for number of files to process. Could be used with metrics & run-all commands. Default is 0 (no limit). |
--batch-size |
Batch size for LLM processing. Could be used with semantic command. Default is 8. |
--threshold |
Threshold for similarity detection. Could be used with astsim command. Default is 0.5. |
--sample |
Sample size for similarity detection. Could be used with astsim command. Default is 500. |
--model |
Model name for embedding generation. Could be used with embed command. Default is all-MiniLM-L6-v2. |
Now, this module depends on external tools: ollama & jscpd that should be installed separately before using.
Without installed ollama or jscpd components on system, some review tools will not work completely.
jscpd is used for duplicate detection inside single file and could be installed via npm using: npm install -g jscpd command.
ollama is used for semantic analysis and could be installed following: oficial repository instructions using any preferred way.
Output structure after running run-all command should be like this:
./repo_audit_output/
├── ast_similarity.csv
├── clusters.csv
├── duplicates.csv
├── embeddings.npz
├── intrafile_duplication.json
│ └── jscpd-report.json
├── inventory.csv
├── metrics.csv
├── ollama_analysis.jsonl
├── quarantine.csv
└── report.md
Each file contains information about specific review step. Files named equal to review tool name.
- Some modules may not work correctly yet. Work in progress.
ollamaandjscpdare required for full functionality but are not managed by pip.
- Improve logging control. Save logs to separate log file.
- Refactor code to use
.yamlconfig files for settings and variables instead ofconfig.pyfile. - Append dependencies validation on startup for external tools (
ollama,jscpd). - Improve code quality and documentation.
- Integrate LLM-md summary generation.
- Integrate ollama model stopping after completion.
AI-assisted tools were used during development for code explanation, review, and minor refactoring and grammar correction. However, all logic and architecture are authored manually.