This project implements a code search engine using sentence embeddings and vector similarity search. It allows users to search for relevant code snippets given a natural language query.
The project is built in Python using:
- Sentence Transformers for embeddings.
- Qdrant as a vector database.
- Hugging Face pretrained model
all-MiniLM-L6-v2for encoding code and text.
Embeddings-based-code-search-engine/
│
├── evaluation_functions/ # Evaluation metrics implementations
│ ├── __init__.py
│ ├── evaluation_function.py # Recall@10, MRR@10, nDCG@10
│ └── evaluation_function_heads.py # Functions' heads only
│
├── training/ # Training and fine-tuning scripts
│ ├── __init__.py
│ ├── loss_logging.py # Wrapper around MultipleNegativeRankingLoss
│ └── train.py # Main training loop and fine-tuning logic
│
├── utils/ # Utility functions used throughout the project
│ ├── __init__.py
│ ├── function_name_extraction.py # Function name extractor
│ ├── plotting.py # Plotting function
│ └── search_function.py # Search query function
│
├── search-engine.ipynb # Jupyter notebook demonstrating project usage and examples
│
├── training_loss_analysis.png # Visualization plot for training loss during fine-tuning
│
├── tuned_model/ # Fine-tuned model
│
├── .gitignore
├── requirements.txt # List of Python package dependencies and versions
└── README.md # Project overview, setup instructions, and documentation
- Embeddings-based search engine
- Loads a collection of code snippets and natural language queries
- Encodes both using a pretrained SentenceTransformer
- Stores embeddings in Qdrant
- Supports semantic search over the collection via a simple API
- Evaluation metrics
- Uses the CoSQA dataset (Code Search Question Answering)
- Implements standard ranking metrics:
- Recall@10
- MRR@10 (Mean Reciprocal Rank)
- NDCG@10 (Normalized Discounted Cumulative Gain)
- Computes metrics to evaluate search quality on the dataset
- Fine-tuning
- Fine-tunes the embedding model on the CoSQA training split
- Uses a contrastive loss suitable for code search tasks
- Visualizes training loss over epochs
- Re-evaluates the search engine with the fine-tuned model to show performance improvement
- Bonus
- How do the metrics change when you apply the model to function names instead of whole bodies?
git clone git@github.com:FirstOne96/Embeddings-based-code-search-engine.git
cd Embeddings-based-code-search-engine
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install -r requirements.txt
jupyter lab
The main logic of the project is contained in the provided Jupyter notebook and python scripts.
Open the notebook in Jupyter Lab and run the cells step by step:
- Initialize the model and vector DB
- Loads
all-MiniLM-L6-v2and sets up Qdrant for embedding storage.
- Loads
- Index the dataset
- Embeds code snippets and inserts them into the Qdrant collection.
- Search with a query
- Input a natural language query and retrieve the top-k most relevant code snippets.
- Evaluate
- Run the evaluation cells to compute
Recall@10,MRR@10, andNDCG@10on CoSQA.
- Run the evaluation cells to compute
- Fine-tune
- Train the model on the CoSQA train set using contrastive loss.
- Visualize loss and observe metric improvements on the test set.
- Bonus analysis
- Run the bonus cells to compare retrieval based on function names vs. function bodies.
| Model | Recall@10 | MRR@10 | NDCG@10 |
|---|---|---|---|
| Pretrained (MiniLM) | 98.08% | 85.44% | 88.61% |
| Fine-tuned (CoSQA) | 100% | 90.91% | 93.23% |
| Function names only | 81.79% | 61.36% | 66.32% |
Fine-tuning improves retrieval quality significantly compared to using pretrained embeddings only.
- Fine-tuning is lightweight just to demonstrate improvements.
- Qdrant in this implementation runs in-memory, but also can be running using their cloud service.
Andrii Kozlov - andrijkozlov96@gmail.com | https://t.me/AndrewKozz | https://www.linkedin.com/in/andrii-kozlov96
Project Link