This module is part of DevLink AI, our final year research project which was designed to analyze project–developer compatibility by aggregating multi-source signals (GitHub, Stack Overflow, social signals, and personality cues) to match developers with projects.
My contribution was the development of the Stack Overflow Knowledge Depth Analysis Module.
Most existing systems evaluate developers based on activity volume or reputation. However, they often fail to measure how conceptually deep a developer's knowledge is. This project addresses that gap by:
- Classifying Stack Overflow questions into Basic, Intermediate, or Advanced levels.
- Scoring developers based on the difficulty and impact of their contributions using a log-scaled XP algorithm.
Problem
Traditional Stack Overflow metrics (reputation, counts, upvotes) emphasize activity, not the depth of technical knowledge demonstrated in content.
Solution
Automatically classify the difficulty level of posts (Basic / Intermediate / Advanced) and compute a knowledge depth score that jointly considers content complexity and contribution type (question vs. answer).
Novelty
- Hybrid modeling: combine rule-based conceptual features (36 macro categories) with statistical text features (TF-IDF).
- Log-scaled scoring: emphasizes quality over quantity, rewarding advanced, high-signal contributions.
- Beyond reputation: captures true expertise rather than engagement-only proxies.
-
Data Preprocessing
- Handle missing/duplicate records.
- Clean HTML/links/punctuation; tokenize; lemmatize with POS tagging.
-
Feature Extraction
- Conceptual (rule-based): 36 macro features spanning Basic, Intermediate, Advanced knowledge areas (e.g., Syntax, Data Structures, System Design).
- Statistical: TF-IDF vectorization of post text.
-
Classification
- Train an SVM using TF-IDF + rule-based features.
- Benchmark against a TF-IDF–only baseline.
-
Knowledge Depth Scoring
- Log-scaled weighted score combining predicted difficulty and post type (Q/A) to produce a per-user knowledge depth metric.
- Python
- Scikit-learn
- Pandas / NumPy
- OpenAI GPT
- Inspired by Bloom’s Taxonomy for CS education
- Grounded in prior works on difficulty classification and developer modeling
- Designed for explainability, transparency, and real-world applicability