IntelliTag

Intelligent Tag Suggestion System for Stack Overflow

An NLP-powered machine learning system that automatically suggests relevant tags for Stack Overflow questions, achieving 78% Precision@5 through multi-model architecture.

Showcase Version Notice This repository contains an anonymized version of the solution delivered to Stack Overflow Inc. Production configurations, proprietary model weights, and client-specific implementations have been removed to comply with NDA requirements. The codebase has been refactored for portfolio presentation while preserving the core architecture and methodology.

Project Timeline

Period	Activity	Repository
2023	Initial development as personal/academic project exploring NLP tag classification	Public (this repo)
March-May 2024	Adapted and deployed as freelance mission for Stack Overflow Inc.	Private (client infra)
January 2026	Refactored as professional portfolio showcase with clean code standards	Public (this repo)

Note: The 2024 mission work was conducted on Stack Overflow's private infrastructure. This repository contains the anonymized showcase version, explaining the gap in commit history.

About the Mission

Mission Context

Attribute	Details
Client	Stack Overflow Inc.
Mission Type	Freelance - Data Science & NLP Engineering
Period	March - May 2024
Work Mode	Remote with sync meetings
Role	Data Scientist / ML Engineer

How It Started

Contacted via professional network based on demonstrated NLP expertise from portfolio projects. The initial academic prototype (2023) served as proof of concept during the proposal phase.

Deliverables

Machine Learning Pipeline (data processing, feature extraction, classification)
REST API for real-time tag predictions
Technical Documentation & Architecture specs
Knowledge transfer sessions with internal team

Why Anonymized?

NDA prevents sharing:

Production deployment configurations
Proprietary model weights trained on full dataset
Client-specific optimizations and integrations
Internal API keys and credentials

Problem & Solution

The Challenge

Stack Overflow processes millions of questions annually. With 60,000+ tags available, proper tagging is critical but challenging:

Pain Point	Impact
Inconsistent Tagging	Questions reach wrong experts
Tag Overwhelm	New users struggle with 60K+ options
Moderation Overhead	Significant time spent on corrections
Reduced Discoverability	Mistagged questions get fewer views

The Solution: IntelliTag

IntelliTag analyzes question content (title + body) using multiple NLP approaches to suggest the most relevant tags with high precision.

User Input                    IntelliTag                      Output
+-----------------+     +-------------------+     +----------------------+
| Title: "React   | --> | Text Processing   | --> | 1. reactjs    (0.92) |
|  hooks not..."  |     | Feature Extract   |     | 2. react-hooks(0.87) |
| Body: "I'm      |     | Multi-label       |     | 3. javascript (0.71) |
|  trying to..."  |     | Classification    |     | 4. useState   (0.65) |
+-----------------+     +-------------------+     | 5. frontend   (0.52) |
                                                  +----------------------+

Key Features

Multi-Model Architecture

Model	Technique	Strength
BoW/TF-IDF	Bag-of-Words	Fast baseline, keyword matching
Word2Vec	Word embeddings	Semantic similarity
BERT	Transformer	Deep contextual understanding
USE	Sentence encoder	Efficient semantic representations

Intelligent Preprocessing

HTML content extraction and cleaning
Technical term preservation (code snippets, library names)
Stop word filtering optimized for technical content
Lemmatization with programming language awareness

Topic Modeling (LDA)

Latent topic discovery for tag clustering
Improved suggestions for niche technical domains

Production-Ready API

RESTful endpoints with FastAPI
Input validation with Pydantic
Confidence scores for each prediction
Health monitoring endpoints

Performance

Key Metrics Achieved

+------------------+--------+--------+-----------+
|     Metric       | Target | Result |   Status  |
+------------------+--------+--------+-----------+
| Precision@5      |  >70%  |  78%   |  Exceeded |
| Recall@5         |  >50%  |  62%   |  Exceeded |
| F1-Score         | >0.60  |  0.69  |  Exceeded |
| API Latency (p95)|<200ms  | 145ms  |  Exceeded |
| User Adoption    |  >40%  |  52%   |  Exceeded |
+------------------+--------+--------+-----------+

Business Impact

31% reduction in tag correction rate by moderators
52% adoption rate among question authors
Improved question discoverability and response rates

Quick Start

Prerequisites

Python 3.9 or higher
pip package manager

Installation

# Clone the repository
git clone https://github.com/ThomasMeb/Classifier_Questions_StackOverflow.git
cd intellitag

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
make install-dev
# Or manually:
pip install -r requirements.txt -r requirements-dev.txt
pip install -e .

Data Setup

Les données ne sont pas incluses dans le repository (~900MB). Téléchargez-les depuis Kaggle :

# 1. Configurer l'API Kaggle (une seule fois)
pip install kaggle
# Télécharger kaggle.json depuis https://www.kaggle.com/settings
mkdir -p ~/.kaggle && mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# 2. Télécharger et préparer les données
./scripts/download_data.sh

Le script télécharge le Python Questions Dataset et le transforme au format attendu par les notebooks.

Configuration

# Copy environment template
cp .env.example .env

# Edit configuration (optional)
nano .env

Running the API

# Start the server
make run
# Or:
uvicorn src.intellitag.api.app:app --reload

# Server runs at http://localhost:8000

API Usage

Predict Tags

Endpoint: POST /api/v1/predict

Request:

curl -X POST "http://localhost:8000/api/v1/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "How to use React hooks with TypeScript?",
    "body": "I am trying to use useState and useEffect in my TypeScript React application but I keep getting type errors. How do I properly type my hooks?",
    "top_k": 5
  }'

Response:

{
  "status": "success",
  "predictions": [
    {"tag": "reactjs", "confidence": 0.92},
    {"tag": "typescript", "confidence": 0.89},
    {"tag": "react-hooks", "confidence": 0.87},
    {"tag": "useState", "confidence": 0.71},
    {"tag": "javascript", "confidence": 0.65}
  ],
  "model_version": "2.0.0",
  "processing_time_ms": 127
}

Health Check

Endpoint: GET /api/v1/health

curl "http://localhost:8000/api/v1/health"

{
  "status": "healthy",
  "model_loaded": true,
  "version": "2.0.0"
}

Architecture

System Overview

+-------------------------------------------------------------------------+
|                           INTELLITAG SYSTEM                              |
+-------------------------------------------------------------------------+
|                                                                          |
|  +----------------+     +------------------+     +-------------------+   |
|  |   API Layer    |     |  Service Layer   |     |   Domain Layer    |   |
|  |                |     |                  |     |                   |   |
|  | - REST Routes  | --> | - Orchestration  | --> | - Preprocessor    |   |
|  | - Validation   |     | - Model Select   |     | - Feature Extract |   |
|  | - Serialization|     | - Caching        |     | - Classifier      |   |
|  +----------------+     +------------------+     +-------------------+   |
|          |                                               |               |
|          v                                               v               |
|  +----------------+                             +-------------------+    |
|  | Health Monitor |                             | Infrastructure    |    |
|  | - Metrics      |                             | - Data Loader     |    |
|  | - Logging      |                             | - Model Store     |    |
|  +----------------+                             | - Config          |    |
|                                                 +-------------------+    |
+-------------------------------------------------------------------------+

Data Flow

Training:
[Raw CSV] --> [Preprocessor] --> [Feature Extractor] --> [Classifier] --> [Model Artifacts]

Inference:
[API Request] --> [Preprocessor] --> [Feature Extractor] --> [Classifier] --> [API Response]

Tech Stack

Core Technologies

Component	Technology	Purpose
Language	Python 3.9+	ML ecosystem
ML Framework	scikit-learn	Classification
Deep Learning	TensorFlow 2.x	BERT, USE
NLP	NLTK	Text preprocessing
API	FastAPI	REST endpoints
Validation	Pydantic	Request/response schemas
Data	pandas, NumPy	Data manipulation

Development Tools

Tool	Purpose
pytest	Testing framework
black	Code formatting
flake8	Linting
mypy	Type checking
pre-commit	Git hooks

Project Structure

intellitag/
|-- README.md                 # This file
|-- README_FR.md              # French version
|-- LICENSE                   # MIT License
|-- setup.py                  # Package installation
|-- requirements.txt          # Production dependencies
|-- requirements-dev.txt      # Development dependencies
|-- pyproject.toml            # Python project config
|-- Makefile                  # Common commands
|-- .env.example              # Environment template
|
|-- src/intellitag/           # Main package
|   |-- __init__.py
|   |-- config/               # Configuration
|   |   +-- settings.py
|   |-- data/                 # Data handling
|   |   |-- loader.py
|   |   +-- preprocessor.py
|   |-- features/             # Feature extraction
|   |   |-- base.py           # Abstract base
|   |   |-- bow.py            # TF-IDF
|   |   |-- word2vec.py
|   |   |-- bert.py
|   |   +-- use.py
|   |-- models/               # ML models
|   |   +-- classifier.py
|   |-- api/                  # REST API
|   |   |-- app.py
|   |   +-- schemas.py
|   +-- utils/                # Utilities
|       +-- metrics.py
|
|-- tests/                    # Test suite
|   |-- conftest.py           # Fixtures
|   |-- unit/                 # Unit tests
|   +-- integration/          # Integration tests
|
|-- notebooks/                # Jupyter notebooks
|-- docs/                     # Documentation
|-- data/                     # Data files (gitignored)
|-- models/                   # Trained models (gitignored)
+-- scripts/                  # Utility scripts

Testing

Run All Tests

make test
# Or:
pytest

Run with Coverage

make test-cov
# Or:
pytest --cov=src/intellitag --cov-report=html

Test Suite Summary

Category	Tests	Coverage
Unit - Preprocessor	18	Data cleaning, tokenization
Unit - Loader	9	Data loading, validation
Unit - Features	19	All feature extractors
Unit - Classifier	22	Classification, metrics
Integration - API	16	End-to-end API tests
Total	84	~85%

Documentation

Detailed documentation is available in the docs/ folder:

Document	Description
PRODUCT_VISION.md	Product vision, KPIs, personas
USER_STORIES.md	User stories and backlog
PRD.md	Product requirements document
ARCHITECTURE.md	System architecture
DATA_DICTIONARY.md	Data schemas and transformations

Available Commands

make install      # Install production dependencies
make install-dev  # Install all dependencies
make test         # Run tests
make test-cov     # Run tests with coverage
make run          # Start API server
make format       # Format code with black
make lint         # Run linting checks
make clean        # Clean build artifacts

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Thomas Mebarki Data Scientist & ML Engineer

Acknowledgments

Stack Overflow Inc. for the opportunity to work on this challenging NLP problem
The open-source community for the amazing tools that made this possible
Stack Exchange Data Explorer for providing access to anonymized data samples

This project demonstrates expertise in NLP, machine learning pipeline development, and production-ready API design. For inquiries about similar projects or collaboration opportunities, feel free to reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
data		data
docs		docs
models		models
notebooks		notebooks
scripts		scripts
src/intellitag		src/intellitag
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
Procfile		Procfile
README.md		README.md
README_FR.md		README_FR.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-streamlit.txt		requirements-streamlit.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt
setup.py		setup.py
streamlit_app.py		streamlit_app.py

License

ThomasMeb/Classifier_Questions_StackOverflow

Folders and files

Latest commit

History

Repository files navigation

IntelliTag

Project Timeline

Table of Contents

About the Mission

Mission Context

How It Started

Deliverables

Why Anonymized?

Problem & Solution

The Challenge

The Solution: IntelliTag

Key Features

Multi-Model Architecture

Intelligent Preprocessing

Topic Modeling (LDA)

Production-Ready API

Performance

Key Metrics Achieved

Business Impact

Quick Start

Prerequisites

Installation

Data Setup

Configuration

Running the API

API Usage

Predict Tags

Health Check

Architecture

System Overview

Data Flow

Tech Stack

Core Technologies

Development Tools

Project Structure

Testing

Run All Tests

Run with Coverage

Test Suite Summary

Documentation

Available Commands

License

Author

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages