Skip to content

Emysha99/Stackoverflow_Technical_Knowledge_Depth_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stack Overflow Knowledge Depth Analysis

🔎 Overview

This module is part of DevLink AI, our final year research project which was designed to analyze project–developer compatibility by aggregating multi-source signals (GitHub, Stack Overflow, social signals, and personality cues) to match developers with projects.

My contribution was the development of the Stack Overflow Knowledge Depth Analysis Module.

🧠 Problem Statement

Most existing systems evaluate developers based on activity volume or reputation. However, they often fail to measure how conceptually deep a developer's knowledge is. This project addresses that gap by:

  • Classifying Stack Overflow questions into Basic, Intermediate, or Advanced levels.
  • Scoring developers based on the difficulty and impact of their contributions using a log-scaled XP algorithm.

❗ Problem – Solution – Novelty

Problem
Traditional Stack Overflow metrics (reputation, counts, upvotes) emphasize activity, not the depth of technical knowledge demonstrated in content.

Solution
Automatically classify the difficulty level of posts (Basic / Intermediate / Advanced) and compute a knowledge depth score that jointly considers content complexity and contribution type (question vs. answer).

Novelty

  • Hybrid modeling: combine rule-based conceptual features (36 macro categories) with statistical text features (TF-IDF).
  • Log-scaled scoring: emphasizes quality over quantity, rewarding advanced, high-signal contributions.
  • Beyond reputation: captures true expertise rather than engagement-only proxies.

🛠️ Approach

  1. Data Preprocessing

    • Handle missing/duplicate records.
    • Clean HTML/links/punctuation; tokenize; lemmatize with POS tagging.
  2. Feature Extraction

    • Conceptual (rule-based): 36 macro features spanning Basic, Intermediate, Advanced knowledge areas (e.g., Syntax, Data Structures, System Design).
    • Statistical: TF-IDF vectorization of post text.
  3. Classification

    • Train an SVM using TF-IDF + rule-based features.
    • Benchmark against a TF-IDF–only baseline.
  4. Knowledge Depth Scoring

    • Log-scaled weighted score combining predicted difficulty and post type (Q/A) to produce a per-user knowledge depth metric.

📊 Technologies Used

  • Python
  • Scikit-learn
  • Pandas / NumPy
  • OpenAI GPT

📚 Research Foundation

  • Inspired by Bloom’s Taxonomy for CS education
  • Grounded in prior works on difficulty classification and developer modeling
  • Designed for explainability, transparency, and real-world applicability

About

This is my individual contribution for final year research project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published