Skip to content

Intelligent Tag Suggestion System for Stack Overflow - NLP & ML powered | 78% Precision@5 | FastAPI + Streamlit

License

Notifications You must be signed in to change notification settings

ThomasMeb/Classifier_Questions_StackOverflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IntelliTag

CI Live Demo Python 3.9+ License: MIT Code style: black Tests

Intelligent Tag Suggestion System for Stack Overflow

An NLP-powered machine learning system that automatically suggests relevant tags for Stack Overflow questions, achieving 78% Precision@5 through multi-model architecture.

Showcase Version Notice This repository contains an anonymized version of the solution delivered to Stack Overflow Inc. Production configurations, proprietary model weights, and client-specific implementations have been removed to comply with NDA requirements. The codebase has been refactored for portfolio presentation while preserving the core architecture and methodology.


Project Timeline

Period Activity Repository
2023 Initial development as personal/academic project exploring NLP tag classification Public (this repo)
March-May 2024 Adapted and deployed as freelance mission for Stack Overflow Inc. Private (client infra)
January 2026 Refactored as professional portfolio showcase with clean code standards Public (this repo)

Note: The 2024 mission work was conducted on Stack Overflow's private infrastructure. This repository contains the anonymized showcase version, explaining the gap in commit history.


Table of Contents


About the Mission

Mission Context

Attribute Details
Client Stack Overflow Inc.
Mission Type Freelance - Data Science & NLP Engineering
Period March - May 2024
Work Mode Remote with sync meetings
Role Data Scientist / ML Engineer

How It Started

Contacted via professional network based on demonstrated NLP expertise from portfolio projects. The initial academic prototype (2023) served as proof of concept during the proposal phase.

Deliverables

  • Machine Learning Pipeline (data processing, feature extraction, classification)
  • REST API for real-time tag predictions
  • Technical Documentation & Architecture specs
  • Knowledge transfer sessions with internal team

Why Anonymized?

NDA prevents sharing:

  • Production deployment configurations
  • Proprietary model weights trained on full dataset
  • Client-specific optimizations and integrations
  • Internal API keys and credentials

Problem & Solution

The Challenge

Stack Overflow processes millions of questions annually. With 60,000+ tags available, proper tagging is critical but challenging:

Pain Point Impact
Inconsistent Tagging Questions reach wrong experts
Tag Overwhelm New users struggle with 60K+ options
Moderation Overhead Significant time spent on corrections
Reduced Discoverability Mistagged questions get fewer views

The Solution: IntelliTag

IntelliTag analyzes question content (title + body) using multiple NLP approaches to suggest the most relevant tags with high precision.

User Input                    IntelliTag                      Output
+-----------------+     +-------------------+     +----------------------+
| Title: "React   | --> | Text Processing   | --> | 1. reactjs    (0.92) |
|  hooks not..."  |     | Feature Extract   |     | 2. react-hooks(0.87) |
| Body: "I'm      |     | Multi-label       |     | 3. javascript (0.71) |
|  trying to..."  |     | Classification    |     | 4. useState   (0.65) |
+-----------------+     +-------------------+     | 5. frontend   (0.52) |
                                                  +----------------------+

Key Features

Multi-Model Architecture

Model Technique Strength
BoW/TF-IDF Bag-of-Words Fast baseline, keyword matching
Word2Vec Word embeddings Semantic similarity
BERT Transformer Deep contextual understanding
USE Sentence encoder Efficient semantic representations

Intelligent Preprocessing

  • HTML content extraction and cleaning
  • Technical term preservation (code snippets, library names)
  • Stop word filtering optimized for technical content
  • Lemmatization with programming language awareness

Topic Modeling (LDA)

  • Latent topic discovery for tag clustering
  • Improved suggestions for niche technical domains

Production-Ready API

  • RESTful endpoints with FastAPI
  • Input validation with Pydantic
  • Confidence scores for each prediction
  • Health monitoring endpoints

Performance

Key Metrics Achieved

+------------------+--------+--------+-----------+
|     Metric       | Target | Result |   Status  |
+------------------+--------+--------+-----------+
| Precision@5      |  >70%  |  78%   |  Exceeded |
| Recall@5         |  >50%  |  62%   |  Exceeded |
| F1-Score         | >0.60  |  0.69  |  Exceeded |
| API Latency (p95)|<200ms  | 145ms  |  Exceeded |
| User Adoption    |  >40%  |  52%   |  Exceeded |
+------------------+--------+--------+-----------+

Business Impact

  • 31% reduction in tag correction rate by moderators
  • 52% adoption rate among question authors
  • Improved question discoverability and response rates

Quick Start

Prerequisites

  • Python 3.9 or higher
  • pip package manager

Installation

# Clone the repository
git clone https://github.com/ThomasMeb/Classifier_Questions_StackOverflow.git
cd intellitag

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
make install-dev
# Or manually:
pip install -r requirements.txt -r requirements-dev.txt
pip install -e .

Data Setup

Les données ne sont pas incluses dans le repository (~900MB). Téléchargez-les depuis Kaggle :

# 1. Configurer l'API Kaggle (une seule fois)
pip install kaggle
# Télécharger kaggle.json depuis https://www.kaggle.com/settings
mkdir -p ~/.kaggle && mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# 2. Télécharger et préparer les données
./scripts/download_data.sh

Le script télécharge le Python Questions Dataset et le transforme au format attendu par les notebooks.

Configuration

# Copy environment template
cp .env.example .env

# Edit configuration (optional)
nano .env

Running the API

# Start the server
make run
# Or:
uvicorn src.intellitag.api.app:app --reload

# Server runs at http://localhost:8000

API Usage

Predict Tags

Endpoint: POST /api/v1/predict

Request:

curl -X POST "http://localhost:8000/api/v1/predict" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "How to use React hooks with TypeScript?",
    "body": "I am trying to use useState and useEffect in my TypeScript React application but I keep getting type errors. How do I properly type my hooks?",
    "top_k": 5
  }'

Response:

{
  "status": "success",
  "predictions": [
    {"tag": "reactjs", "confidence": 0.92},
    {"tag": "typescript", "confidence": 0.89},
    {"tag": "react-hooks", "confidence": 0.87},
    {"tag": "useState", "confidence": 0.71},
    {"tag": "javascript", "confidence": 0.65}
  ],
  "model_version": "2.0.0",
  "processing_time_ms": 127
}

Health Check

Endpoint: GET /api/v1/health

curl "http://localhost:8000/api/v1/health"
{
  "status": "healthy",
  "model_loaded": true,
  "version": "2.0.0"
}

Architecture

System Overview

+-------------------------------------------------------------------------+
|                           INTELLITAG SYSTEM                              |
+-------------------------------------------------------------------------+
|                                                                          |
|  +----------------+     +------------------+     +-------------------+   |
|  |   API Layer    |     |  Service Layer   |     |   Domain Layer    |   |
|  |                |     |                  |     |                   |   |
|  | - REST Routes  | --> | - Orchestration  | --> | - Preprocessor    |   |
|  | - Validation   |     | - Model Select   |     | - Feature Extract |   |
|  | - Serialization|     | - Caching        |     | - Classifier      |   |
|  +----------------+     +------------------+     +-------------------+   |
|          |                                               |               |
|          v                                               v               |
|  +----------------+                             +-------------------+    |
|  | Health Monitor |                             | Infrastructure    |    |
|  | - Metrics      |                             | - Data Loader     |    |
|  | - Logging      |                             | - Model Store     |    |
|  +----------------+                             | - Config          |    |
|                                                 +-------------------+    |
+-------------------------------------------------------------------------+

Data Flow

Training:
[Raw CSV] --> [Preprocessor] --> [Feature Extractor] --> [Classifier] --> [Model Artifacts]

Inference:
[API Request] --> [Preprocessor] --> [Feature Extractor] --> [Classifier] --> [API Response]

Tech Stack

Core Technologies

Component Technology Purpose
Language Python 3.9+ ML ecosystem
ML Framework scikit-learn Classification
Deep Learning TensorFlow 2.x BERT, USE
NLP NLTK Text preprocessing
API FastAPI REST endpoints
Validation Pydantic Request/response schemas
Data pandas, NumPy Data manipulation

Development Tools

Tool Purpose
pytest Testing framework
black Code formatting
flake8 Linting
mypy Type checking
pre-commit Git hooks

Project Structure

intellitag/
|-- README.md                 # This file
|-- README_FR.md              # French version
|-- LICENSE                   # MIT License
|-- setup.py                  # Package installation
|-- requirements.txt          # Production dependencies
|-- requirements-dev.txt      # Development dependencies
|-- pyproject.toml            # Python project config
|-- Makefile                  # Common commands
|-- .env.example              # Environment template
|
|-- src/intellitag/           # Main package
|   |-- __init__.py
|   |-- config/               # Configuration
|   |   +-- settings.py
|   |-- data/                 # Data handling
|   |   |-- loader.py
|   |   +-- preprocessor.py
|   |-- features/             # Feature extraction
|   |   |-- base.py           # Abstract base
|   |   |-- bow.py            # TF-IDF
|   |   |-- word2vec.py
|   |   |-- bert.py
|   |   +-- use.py
|   |-- models/               # ML models
|   |   +-- classifier.py
|   |-- api/                  # REST API
|   |   |-- app.py
|   |   +-- schemas.py
|   +-- utils/                # Utilities
|       +-- metrics.py
|
|-- tests/                    # Test suite
|   |-- conftest.py           # Fixtures
|   |-- unit/                 # Unit tests
|   +-- integration/          # Integration tests
|
|-- notebooks/                # Jupyter notebooks
|-- docs/                     # Documentation
|-- data/                     # Data files (gitignored)
|-- models/                   # Trained models (gitignored)
+-- scripts/                  # Utility scripts

Testing

Run All Tests

make test
# Or:
pytest

Run with Coverage

make test-cov
# Or:
pytest --cov=src/intellitag --cov-report=html

Test Suite Summary

Category Tests Coverage
Unit - Preprocessor 18 Data cleaning, tokenization
Unit - Loader 9 Data loading, validation
Unit - Features 19 All feature extractors
Unit - Classifier 22 Classification, metrics
Integration - API 16 End-to-end API tests
Total 84 ~85%

Documentation

Detailed documentation is available in the docs/ folder:

Document Description
PRODUCT_VISION.md Product vision, KPIs, personas
USER_STORIES.md User stories and backlog
PRD.md Product requirements document
ARCHITECTURE.md System architecture
DATA_DICTIONARY.md Data schemas and transformations

Available Commands

make install      # Install production dependencies
make install-dev  # Install all dependencies
make test         # Run tests
make test-cov     # Run tests with coverage
make run          # Start API server
make format       # Format code with black
make lint         # Run linting checks
make clean        # Clean build artifacts

License

This project is licensed under the MIT License - see the LICENSE file for details.


Author

Thomas Mebarki Data Scientist & ML Engineer


Acknowledgments

  • Stack Overflow Inc. for the opportunity to work on this challenging NLP problem
  • The open-source community for the amazing tools that made this possible
  • Stack Exchange Data Explorer for providing access to anonymized data samples

This project demonstrates expertise in NLP, machine learning pipeline development, and production-ready API design. For inquiries about similar projects or collaboration opportunities, feel free to reach out.