Skip to content

kirillkrs/codebase_review

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Codebase_review Short Description

A modular Python tool designed for auditing large codebases, detecting duplication, identifying structural similarities, clustering files by semantic meaning, and generating comprehensive reports. Built as a training project to automate preliminary code reviews—especially useful when manual inspection is impractical due to scale.

Main Goal: Help filter out redundant, duplicated, or suspicious code to streamline manual review.

Overview

This tool supports (try to) a complete pipeline for static analysis of code repositories. It combines traditional metrics, AST-based similarity, exact and intra-file duplication detection, embedding-based clustering, and LLM-powered semantic analysis (some features are not implemented yet or could work incorrectly)

It was created to assist in reviewing large, complex codebases where manual analysis is time-consuming or infeasible.


Installation

Now it available only via CLI interface.

At first, to run this module, you need to configure module with setup.py in the root of the project:

pip install -e .

Usage

After installation, you can use it via CLI with following command:

python src/codebase_review/cli.py run-all path/to/project/to/review

This will run all review tools available in the project. At same time, it will create directory with result files in root directory (by default) repo_audit_output where output files will be stored.

Commands

Now you can use different commands to run different review tools separately or run all of them completely. Some elements don't work correctly yet :(. I hope to fix it as soon as possible.

Available run arguments are listed below:

Command Description
inventory Scans the given directory and generates a CSV file with file inventory (path, size, extension, etc.).
metrics Collects code metrics (e.g. lines, complexity) from the inventory. Optionally runs lightweight semantic analysis using Ollama (if specified).
duplicates Finds exact duplicate files based on content hash.
astsim Detects structurally similar code using AST comparison. Threshold controls similarity sensitivity (0.0–1.0). Sampling limits performance impact.
intrafile Detects duplicated code blocks within single files using jscpd.
quarantine Identifies potentially problematic files (e.g., auto-generated, large, or suspicious) and logs them.
embed Generates semantic embeddings for source files using a SentenceTransformer model.
cluster Clusters files based on their embeddings to find functionally similar components.
semantic Runs semantic analysis using an Ollama-compatible LLM to classify or describe files.
report Generates a comprehensive Markdown report summarizing findings from all previous steps.
run-all Runs the complete pipeline: inventory → metrics → duplicates → AST sim → intrafile → quarantine → embeddings/clustering (if available) → semantic (if model given) → report.

Available optional arguments are listed below:

Command Description
--output Path to output directory. Default is repo_audit_output in place of launch.
--ollama-model Model name for semantic analysis. Could be used with metrics, semantic & run-all commands. Default is "Qwen3-4b"
--limit Limit for number of files to process. Could be used with metrics & run-all commands. Default is 0 (no limit).
--batch-size Batch size for LLM processing. Could be used with semantic command. Default is 8.
--threshold Threshold for similarity detection. Could be used with astsim command. Default is 0.5.
--sample Sample size for similarity detection. Could be used with astsim command. Default is 500.
--model Model name for embedding generation. Could be used with embed command. Default is all-MiniLM-L6-v2.

Important Dependencies

Now, this module depends on external tools: ollama & jscpd that should be installed separately before using. Without installed ollama or jscpd components on system, some review tools will not work completely. jscpd is used for duplicate detection inside single file and could be installed via npm using: npm install -g jscpd command. ollama is used for semantic analysis and could be installed following: oficial repository instructions using any preferred way.

Output structure

Output structure after running run-all command should be like this:

./repo_audit_output/
├── ast_similarity.csv
├── clusters.csv
├── duplicates.csv
├── embeddings.npz
├── intrafile_duplication.json
│   └── jscpd-report.json
├── inventory.csv
├── metrics.csv
├── ollama_analysis.jsonl
├── quarantine.csv
└── report.md

Each file contains information about specific review step. Files named equal to review tool name.


Known Limitations

  • Some modules may not work correctly yet. Work in progress.
  • ollama and jscpd are required for full functionality but are not managed by pip.

Future Plans (Suggested)

  • Improve logging control. Save logs to separate log file.
  • Refactor code to use .yaml config files for settings and variables instead of config.py file.
  • Append dependencies validation on startup for external tools (ollama, jscpd).
  • Improve code quality and documentation.
  • Integrate LLM-md summary generation.
  • Integrate ollama model stopping after completion.

Important Notes

AI-assisted tools were used during development for code explanation, review, and minor refactoring and grammar correction. However, all logic and architecture are authored manually.

About

Codebase review tool?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages