Skip to content

This project implements a code search engine using sentence embeddings and vector similarity search. It allows users to search for relevant code snippets given a natural language query.

Notifications You must be signed in to change notification settings

FirstOne96/Embeddings-based-code-search-engine

Repository files navigation

Python Model Database Status

🤖 Embeddings-based-code-search-engine

This project implements a code search engine using sentence embeddings and vector similarity search. It allows users to search for relevant code snippets given a natural language query.

The project is built in Python using:

  • Sentence Transformers for embeddings.
  • Qdrant as a vector database.
  • Hugging Face pretrained model all-MiniLM-L6-v2 for encoding code and text.

👣 Project Structure:

Embeddings-based-code-search-engine/
│
├── evaluation_functions/             # Evaluation metrics implementations
│   ├── __init__.py                  
│   ├── evaluation_function.py        # Recall@10, MRR@10, nDCG@10
│   └── evaluation_function_heads.py  # Functions' heads only
│
├── training/                         # Training and fine-tuning scripts
│   ├── __init__.py                   
│   ├── loss_logging.py               # Wrapper around MultipleNegativeRankingLoss 
│   └── train.py                      # Main training loop and fine-tuning logic
│
├── utils/                            # Utility functions used throughout the project
│   ├── __init__.py                   
│   ├── function_name_extraction.py   # Function name extractor
│   ├── plotting.py                   # Plotting function
│   └── search_function.py            # Search query function
│
├── search-engine.ipynb               # Jupyter notebook demonstrating project usage and examples
│
├── training_loss_analysis.png        # Visualization plot for training loss during fine-tuning
│
├── tuned_model/                      # Fine-tuned model 
│
├── .gitignore                        
├── requirements.txt                  # List of Python package dependencies and versions
└── README.md                         # Project overview, setup instructions, and documentation

💿 Features:

  1. Embeddings-based search engine
    • Loads a collection of code snippets and natural language queries
    • Encodes both using a pretrained SentenceTransformer
    • Stores embeddings in Qdrant
    • Supports semantic search over the collection via a simple API
  2. Evaluation metrics
    • Uses the CoSQA dataset (Code Search Question Answering)
    • Implements standard ranking metrics:
      • Recall@10
      • MRR@10 (Mean Reciprocal Rank)
      • NDCG@10 (Normalized Discounted Cumulative Gain)
    • Computes metrics to evaluate search quality on the dataset
  3. Fine-tuning
    • Fine-tunes the embedding model on the CoSQA training split
    • Uses a contrastive loss suitable for code search tasks
    • Visualizes training loss over epochs
    • Re-evaluates the search engine with the fine-tuned model to show performance improvement
  4. Bonus
    • How do the metrics change when you apply the model to function names instead of whole bodies?

⚙️ Instalation:

  • Clone the repository

git clone git@github.com:FirstOne96/Embeddings-based-code-search-engine.git
cd Embeddings-based-code-search-engine

  • Create a virtual environment

python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows

  • Install requirements

pip install -r requirements.txt

  • Launch Jupyter Lab

jupyter lab


🕹 Usage:

The main logic of the project is contained in the provided Jupyter notebook and python scripts.
Open the notebook in Jupyter Lab and run the cells step by step:

  1. Initialize the model and vector DB
    • Loads all-MiniLM-L6-v2 and sets up Qdrant for embedding storage.
  2. Index the dataset
    • Embeds code snippets and inserts them into the Qdrant collection.
  3. Search with a query
    • Input a natural language query and retrieve the top-k most relevant code snippets.
  4. Evaluate
    • Run the evaluation cells to compute Recall@10, MRR@10, and NDCG@10 on CoSQA.
  5. Fine-tune
    • Train the model on the CoSQA train set using contrastive loss.
    • Visualize loss and observe metric improvements on the test set.
  6. Bonus analysis
    • Run the bonus cells to compare retrieval based on function names vs. function bodies.

⚒ Results:

Model Recall@10 MRR@10 NDCG@10
Pretrained (MiniLM) 98.08% 85.44% 88.61%
Fine-tuned (CoSQA) 100% 90.91% 93.23%
Function names only 81.79% 61.36% 66.32%

Fine-tuning improves retrieval quality significantly compared to using pretrained embeddings only.

Training Loss Curve:

training_loss_analysis

🔍 Notes:

  • Fine-tuning is lightweight just to demonstrate improvements.
  • Qdrant in this implementation runs in-memory, but also can be running using their cloud service.

📞 Contact:

Andrii Kozlov - andrijkozlov96@gmail.com | https://t.me/AndrewKozz | https://www.linkedin.com/in/andrii-kozlov96
Project Link

About

This project implements a code search engine using sentence embeddings and vector similarity search. It allows users to search for relevant code snippets given a natural language query.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published