🤖 Embeddings-based-code-search-engine

This project implements a code search engine using sentence embeddings and vector similarity search. It allows users to search for relevant code snippets given a natural language query.

The project is built in Python using:

Sentence Transformers for embeddings.
Qdrant as a vector database.
Hugging Face pretrained model all-MiniLM-L6-v2 for encoding code and text.

👣 Project Structure:

Embeddings-based-code-search-engine/
│
├── evaluation_functions/             # Evaluation metrics implementations
│   ├── __init__.py                  
│   ├── evaluation_function.py        # Recall@10, MRR@10, nDCG@10
│   └── evaluation_function_heads.py  # Functions' heads only
│
├── training/                         # Training and fine-tuning scripts
│   ├── __init__.py                   
│   ├── loss_logging.py               # Wrapper around MultipleNegativeRankingLoss 
│   └── train.py                      # Main training loop and fine-tuning logic
│
├── utils/                            # Utility functions used throughout the project
│   ├── __init__.py                   
│   ├── function_name_extraction.py   # Function name extractor
│   ├── plotting.py                   # Plotting function
│   └── search_function.py            # Search query function
│
├── search-engine.ipynb               # Jupyter notebook demonstrating project usage and examples
│
├── training_loss_analysis.png        # Visualization plot for training loss during fine-tuning
│
├── tuned_model/                      # Fine-tuned model 
│
├── .gitignore                        
├── requirements.txt                  # List of Python package dependencies and versions
└── README.md                         # Project overview, setup instructions, and documentation

💿 Features:

Embeddings-based search engine
- Loads a collection of code snippets and natural language queries
- Encodes both using a pretrained SentenceTransformer
- Stores embeddings in Qdrant
- Supports semantic search over the collection via a simple API
Evaluation metrics
- Uses the CoSQA dataset (Code Search Question Answering)
- Implements standard ranking metrics:
  - Recall@10
  - MRR@10 (Mean Reciprocal Rank)
  - NDCG@10 (Normalized Discounted Cumulative Gain)
- Computes metrics to evaluate search quality on the dataset
Fine-tuning
- Fine-tunes the embedding model on the CoSQA training split
- Uses a contrastive loss suitable for code search tasks
- Visualizes training loss over epochs
- Re-evaluates the search engine with the fine-tuned model to show performance improvement
Bonus
- How do the metrics change when you apply the model to function names instead of whole bodies?

⚙️ Instalation:

Clone the repository

git clone git@github.com:FirstOne96/Embeddings-based-code-search-engine.git
cd Embeddings-based-code-search-engine

Create a virtual environment

python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows

Install requirements

pip install -r requirements.txt

Launch Jupyter Lab

jupyter lab

🕹 Usage:

The main logic of the project is contained in the provided Jupyter notebook and python scripts.
Open the notebook in Jupyter Lab and run the cells step by step:

Initialize the model and vector DB
- Loads all-MiniLM-L6-v2 and sets up Qdrant for embedding storage.
Index the dataset
- Embeds code snippets and inserts them into the Qdrant collection.
Search with a query
- Input a natural language query and retrieve the top-k most relevant code snippets.
Evaluate
- Run the evaluation cells to compute Recall@10, MRR@10, and NDCG@10 on CoSQA.
Fine-tune
- Train the model on the CoSQA train set using contrastive loss.
- Visualize loss and observe metric improvements on the test set.
Bonus analysis
- Run the bonus cells to compare retrieval based on function names vs. function bodies.

⚒ Results:

Model	Recall@10	MRR@10	NDCG@10
Pretrained (MiniLM)	98.08%	85.44%	88.61%
Fine-tuned (CoSQA)	100%	90.91%	93.23%
Function names only	81.79%	61.36%	66.32%

Fine-tuning improves retrieval quality significantly compared to using pretrained embeddings only.

Training Loss Curve:

🔍 Notes:

Fine-tuning is lightweight just to demonstrate improvements.
Qdrant in this implementation runs in-memory, but also can be running using their cloud service.

📞 Contact:

Andrii Kozlov - andrijkozlov96@gmail.com | https://t.me/AndrewKozz | https://www.linkedin.com/in/andrii-kozlov96
Project Link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Embeddings-based-code-search-engine

👣 Project Structure:

💿 Features:

⚙️ Instalation:

Clone the repository

Create a virtual environment

Install requirements

Launch Jupyter Lab

🕹 Usage:

⚒ Results:

Training Loss Curve:

🔍 Notes:

📞 Contact:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
evaluation_functions		evaluation_functions
training		training
tuned_model		tuned_model
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
search-engine.ipynb		search-engine.ipynb
training_loss_analysis.png		training_loss_analysis.png

FirstOne96/Embeddings-based-code-search-engine

Folders and files

Latest commit

History

Repository files navigation

🤖 Embeddings-based-code-search-engine

👣 Project Structure:

💿 Features:

⚙️ Instalation:

Clone the repository

Create a virtual environment

Install requirements

Launch Jupyter Lab

🕹 Usage:

⚒ Results:

Training Loss Curve:

🔍 Notes:

📞 Contact:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages