Intelligent Tag Suggestion System for Stack Overflow
An NLP-powered machine learning system that automatically suggests relevant tags for Stack Overflow questions, achieving 78% Precision@5 through multi-model architecture.
Showcase Version Notice This repository contains an anonymized version of the solution delivered to Stack Overflow Inc. Production configurations, proprietary model weights, and client-specific implementations have been removed to comply with NDA requirements. The codebase has been refactored for portfolio presentation while preserving the core architecture and methodology.
| Period | Activity | Repository |
|---|---|---|
| 2023 | Initial development as personal/academic project exploring NLP tag classification | Public (this repo) |
| March-May 2024 | Adapted and deployed as freelance mission for Stack Overflow Inc. | Private (client infra) |
| January 2026 | Refactored as professional portfolio showcase with clean code standards | Public (this repo) |
Note: The 2024 mission work was conducted on Stack Overflow's private infrastructure. This repository contains the anonymized showcase version, explaining the gap in commit history.
- About the Mission
- Problem & Solution
- Key Features
- Performance
- Quick Start
- API Usage
- Architecture
- Tech Stack
- Project Structure
- Testing
- Documentation
- License
- Author
| Attribute | Details |
|---|---|
| Client | Stack Overflow Inc. |
| Mission Type | Freelance - Data Science & NLP Engineering |
| Period | March - May 2024 |
| Work Mode | Remote with sync meetings |
| Role | Data Scientist / ML Engineer |
Contacted via professional network based on demonstrated NLP expertise from portfolio projects. The initial academic prototype (2023) served as proof of concept during the proposal phase.
- Machine Learning Pipeline (data processing, feature extraction, classification)
- REST API for real-time tag predictions
- Technical Documentation & Architecture specs
- Knowledge transfer sessions with internal team
NDA prevents sharing:
- Production deployment configurations
- Proprietary model weights trained on full dataset
- Client-specific optimizations and integrations
- Internal API keys and credentials
Stack Overflow processes millions of questions annually. With 60,000+ tags available, proper tagging is critical but challenging:
| Pain Point | Impact |
|---|---|
| Inconsistent Tagging | Questions reach wrong experts |
| Tag Overwhelm | New users struggle with 60K+ options |
| Moderation Overhead | Significant time spent on corrections |
| Reduced Discoverability | Mistagged questions get fewer views |
IntelliTag analyzes question content (title + body) using multiple NLP approaches to suggest the most relevant tags with high precision.
User Input IntelliTag Output
+-----------------+ +-------------------+ +----------------------+
| Title: "React | --> | Text Processing | --> | 1. reactjs (0.92) |
| hooks not..." | | Feature Extract | | 2. react-hooks(0.87) |
| Body: "I'm | | Multi-label | | 3. javascript (0.71) |
| trying to..." | | Classification | | 4. useState (0.65) |
+-----------------+ +-------------------+ | 5. frontend (0.52) |
+----------------------+
| Model | Technique | Strength |
|---|---|---|
| BoW/TF-IDF | Bag-of-Words | Fast baseline, keyword matching |
| Word2Vec | Word embeddings | Semantic similarity |
| BERT | Transformer | Deep contextual understanding |
| USE | Sentence encoder | Efficient semantic representations |
- HTML content extraction and cleaning
- Technical term preservation (code snippets, library names)
- Stop word filtering optimized for technical content
- Lemmatization with programming language awareness
- Latent topic discovery for tag clustering
- Improved suggestions for niche technical domains
- RESTful endpoints with FastAPI
- Input validation with Pydantic
- Confidence scores for each prediction
- Health monitoring endpoints
+------------------+--------+--------+-----------+
| Metric | Target | Result | Status |
+------------------+--------+--------+-----------+
| Precision@5 | >70% | 78% | Exceeded |
| Recall@5 | >50% | 62% | Exceeded |
| F1-Score | >0.60 | 0.69 | Exceeded |
| API Latency (p95)|<200ms | 145ms | Exceeded |
| User Adoption | >40% | 52% | Exceeded |
+------------------+--------+--------+-----------+
- 31% reduction in tag correction rate by moderators
- 52% adoption rate among question authors
- Improved question discoverability and response rates
- Python 3.9 or higher
- pip package manager
# Clone the repository
git clone https://github.com/ThomasMeb/Classifier_Questions_StackOverflow.git
cd intellitag
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
make install-dev
# Or manually:
pip install -r requirements.txt -r requirements-dev.txt
pip install -e .Les données ne sont pas incluses dans le repository (~900MB). Téléchargez-les depuis Kaggle :
# 1. Configurer l'API Kaggle (une seule fois)
pip install kaggle
# Télécharger kaggle.json depuis https://www.kaggle.com/settings
mkdir -p ~/.kaggle && mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
# 2. Télécharger et préparer les données
./scripts/download_data.shLe script télécharge le Python Questions Dataset et le transforme au format attendu par les notebooks.
# Copy environment template
cp .env.example .env
# Edit configuration (optional)
nano .env# Start the server
make run
# Or:
uvicorn src.intellitag.api.app:app --reload
# Server runs at http://localhost:8000Endpoint: POST /api/v1/predict
Request:
curl -X POST "http://localhost:8000/api/v1/predict" \
-H "Content-Type: application/json" \
-d '{
"title": "How to use React hooks with TypeScript?",
"body": "I am trying to use useState and useEffect in my TypeScript React application but I keep getting type errors. How do I properly type my hooks?",
"top_k": 5
}'Response:
{
"status": "success",
"predictions": [
{"tag": "reactjs", "confidence": 0.92},
{"tag": "typescript", "confidence": 0.89},
{"tag": "react-hooks", "confidence": 0.87},
{"tag": "useState", "confidence": 0.71},
{"tag": "javascript", "confidence": 0.65}
],
"model_version": "2.0.0",
"processing_time_ms": 127
}Endpoint: GET /api/v1/health
curl "http://localhost:8000/api/v1/health"{
"status": "healthy",
"model_loaded": true,
"version": "2.0.0"
}+-------------------------------------------------------------------------+
| INTELLITAG SYSTEM |
+-------------------------------------------------------------------------+
| |
| +----------------+ +------------------+ +-------------------+ |
| | API Layer | | Service Layer | | Domain Layer | |
| | | | | | | |
| | - REST Routes | --> | - Orchestration | --> | - Preprocessor | |
| | - Validation | | - Model Select | | - Feature Extract | |
| | - Serialization| | - Caching | | - Classifier | |
| +----------------+ +------------------+ +-------------------+ |
| | | |
| v v |
| +----------------+ +-------------------+ |
| | Health Monitor | | Infrastructure | |
| | - Metrics | | - Data Loader | |
| | - Logging | | - Model Store | |
| +----------------+ | - Config | |
| +-------------------+ |
+-------------------------------------------------------------------------+
Training:
[Raw CSV] --> [Preprocessor] --> [Feature Extractor] --> [Classifier] --> [Model Artifacts]
Inference:
[API Request] --> [Preprocessor] --> [Feature Extractor] --> [Classifier] --> [API Response]
| Component | Technology | Purpose |
|---|---|---|
| Language | Python 3.9+ | ML ecosystem |
| ML Framework | scikit-learn | Classification |
| Deep Learning | TensorFlow 2.x | BERT, USE |
| NLP | NLTK | Text preprocessing |
| API | FastAPI | REST endpoints |
| Validation | Pydantic | Request/response schemas |
| Data | pandas, NumPy | Data manipulation |
| Tool | Purpose |
|---|---|
| pytest | Testing framework |
| black | Code formatting |
| flake8 | Linting |
| mypy | Type checking |
| pre-commit | Git hooks |
intellitag/
|-- README.md # This file
|-- README_FR.md # French version
|-- LICENSE # MIT License
|-- setup.py # Package installation
|-- requirements.txt # Production dependencies
|-- requirements-dev.txt # Development dependencies
|-- pyproject.toml # Python project config
|-- Makefile # Common commands
|-- .env.example # Environment template
|
|-- src/intellitag/ # Main package
| |-- __init__.py
| |-- config/ # Configuration
| | +-- settings.py
| |-- data/ # Data handling
| | |-- loader.py
| | +-- preprocessor.py
| |-- features/ # Feature extraction
| | |-- base.py # Abstract base
| | |-- bow.py # TF-IDF
| | |-- word2vec.py
| | |-- bert.py
| | +-- use.py
| |-- models/ # ML models
| | +-- classifier.py
| |-- api/ # REST API
| | |-- app.py
| | +-- schemas.py
| +-- utils/ # Utilities
| +-- metrics.py
|
|-- tests/ # Test suite
| |-- conftest.py # Fixtures
| |-- unit/ # Unit tests
| +-- integration/ # Integration tests
|
|-- notebooks/ # Jupyter notebooks
|-- docs/ # Documentation
|-- data/ # Data files (gitignored)
|-- models/ # Trained models (gitignored)
+-- scripts/ # Utility scripts
make test
# Or:
pytestmake test-cov
# Or:
pytest --cov=src/intellitag --cov-report=html| Category | Tests | Coverage |
|---|---|---|
| Unit - Preprocessor | 18 | Data cleaning, tokenization |
| Unit - Loader | 9 | Data loading, validation |
| Unit - Features | 19 | All feature extractors |
| Unit - Classifier | 22 | Classification, metrics |
| Integration - API | 16 | End-to-end API tests |
| Total | 84 | ~85% |
Detailed documentation is available in the docs/ folder:
| Document | Description |
|---|---|
| PRODUCT_VISION.md | Product vision, KPIs, personas |
| USER_STORIES.md | User stories and backlog |
| PRD.md | Product requirements document |
| ARCHITECTURE.md | System architecture |
| DATA_DICTIONARY.md | Data schemas and transformations |
make install # Install production dependencies
make install-dev # Install all dependencies
make test # Run tests
make test-cov # Run tests with coverage
make run # Start API server
make format # Format code with black
make lint # Run linting checks
make clean # Clean build artifactsThis project is licensed under the MIT License - see the LICENSE file for details.
Thomas Mebarki Data Scientist & ML Engineer
- GitHub: @ThomasMeb
- LinkedIn: Thomas Mebarki
- Email: thomas.mebarki@protonmail.com
- Stack Overflow Inc. for the opportunity to work on this challenging NLP problem
- The open-source community for the amazing tools that made this possible
- Stack Exchange Data Explorer for providing access to anonymized data samples
This project demonstrates expertise in NLP, machine learning pipeline development, and production-ready API design. For inquiries about similar projects or collaboration opportunities, feel free to reach out.