A comprehensive data analysis project examining the relationship between college affordability and student economic mobility outcomes.
This project analyzes the causal relationship between college affordability (measured by the "affordability gap") and student outcomes (graduation rates and earnings 10 years post-enrollment) across U.S. 4-year institutions.
- How does college affordability impact graduation rates?
- What is the effect of affordability on post-graduation earnings?
- Which institutional factors moderate the affordability-outcome relationship?
- Are there differential effects for high vs. low Pell-eligible student populations?
Optimized Model Performance:
- Rยฒ Score: ~77% (explains 77% of earnings variance)
- RMSE: ~$5,500 (prediction error)
- Dataset: 4,800+ U.S. 4-year institutions
- Features: 32 institutional, demographic, and resource variables
Key Findings:
- Affordability gap shows significant predictive power for earnings outcomes
- Selectivity (admit rates, test scores) is the strongest predictor
- Interaction effects between affordability and selectivity matter
- Different patterns observed for high vs. low Pell institutions
- Python 3.11+
- Node.js 25+ (for frontend)
- pip and npm
# Clone/navigate to project
cd /path/to/datathon
# Setup Python environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Setup frontend (if using)
cd fronted
npm install
cd ..1. LLM Chatbot (Quick Exploration)
cd llm
python3 college_chatbot_web.py
# Visit http://localhost:80002. Data Preparation (if starting fresh)
cd src
python3 01_data_preparation.py3. ML Model Training (if needed)
cd src
python3 earnings_mobility_rf_analysis.py4. Load and Use Trained Models
cd src
python3 load_and_predict.py5. Frontend Application
cd fronted
npm run dev
# Visit http://localhost:3000datathon/
โโโ api/ # ๐ Flask backend API for ML models
โโโ llm/ # ๐ค LLM chatbot and analysis tools
โโโ src/ # ๐ฌ Data preparation and ML training
โโโ outputs/ # ๐ Results, models, and figures
โโโ fronted/ # ๐ Next.js web application
โโโ tasks/ # ๐ Project documentation and task lists
โโโ requirements.txt # Python dependencies
โโโ PROJECT_STRUCTURE.md # Detailed structure documentation
See PROJECT_STRUCTURE.md for complete directory documentation.
Flask REST API for serving ML model predictions and institution data.
Features:
- RESTful endpoints for predictions
- Institution search and retrieval
- Model metadata and health monitoring
- CORS enabled for frontend integration
Quick Start:
cd api
./start.sh
# Visit http://localhost:5000Key Endpoints:
GET /health- Health checkGET /models- List available modelsPOST /predict- Make earnings predictionsGET /institutions?name=harvard- Search institutions
๐ API Documentation
Interactive tools for exploring college data through natural language queries.
Features:
- Web-based chatbot interface
- CLI chatbot
- Interactive college search and comparison
- Quick analysis scripts
Quick Start:
cd llm
python3 college_chatbot_web.pyScripts for data cleaning, feature engineering, and model training.
Key Scripts:
01_data_preparation.py- Data loading, cleaning, mergingearnings_mobility_rf_analysis.py- Random Forest model trainingload_and_predict.py- Model inference
Outputs:
/outputs/data/- Cleaned datasets/outputs/rf_analysis/- Trained models and results
Pre-trained Random Forest models for predicting earnings outcomes.
Available Models:
- R1a: Core model (full features + treatment)
- R1b: Baseline (demographics only)
- R1c: Interaction model (affordability ร selectivity)
- R1d: Subgroup models (High/Low Pell)
Model Specs:
- Algorithm: Random Forest Regressor
- Features: 32 variables (treatment, confounders, interactions)
- Hyperparameters: 500 trees, max_depth=25, min_samples_split=5
- Performance: Rยฒ=0.77, RMSE=$5,500
๐ Model Documentation
Next.js web application for data visualization and exploration.
Tech Stack:
- Next.js 15+
- React 19+
- Tailwind CSS
- TypeScript
Status: Ready for development (node_modules setup resolved)
-
Affordability Gap Data (
Affordability_latest_02-17-25 1.csv)- Net price after grants
- Minimum wage gap calculations
- Institution identifiers
-
College Scorecard Data (
College Results View 2021 Data.csv)- Graduation rates
- Earnings outcomes (10-year median)
- Institutional characteristics
- Student demographics
- Selectivity metrics
Merged Dataset: /outputs/data/merged_clean.csv and /fronted/data/merged_clean.csv
- Load and standardize institution identifiers
- Filter to 4-year institutions
- Merge affordability and scorecard data
- Handle missing data (listwise deletion for critical vars)
- Create treatment variable (affordability quartiles)
- Engineer features (selectivity, demographics, resources)
- Median imputation for remaining missing confounders
- Random Forest Regression (non-parametric, handles non-linearity)
- Feature Engineering:
- Treatment: Affordability gap (continuous + quartiles)
- Outcomes: Graduation rates, 10-year median earnings
- Confounders: 32 variables across 5 categories
- Model Variants:
- Core, baseline, interaction, and subgroup models
- Optimization:
- Hyperparameter tuning (trees, depth, splits)
- Missing data indicators (SAT/ACT flags)
- Cross-validation
- Selectivity: Admit rate, SAT/ACT scores, flags for missing test scores
- Institution: Sector, size, state, region, control type
- Demographics: % Pell, % URM, % by race/ethnicity, % women
- Resources: Instructional expenditure, endowment (log + binary)
- MSI Indicators: HBCU, HSI, AANAPISI, etc.
treatment_distribution.png- Affordability gap by quartilegraduation_rate_distribution.png- 6-year grad ratesearnings_distribution.png- 10-year median earnings
SUMMARY_REPORT.md- Comprehensive findingsPERFORMANCE_IMPROVEMENTS.md- Optimization resultsmodel_summary.csv- Performance metricsfeature_importance_*.csv- Feature rankings
analysis_log.md- Detailed decision logtask_2.0_stop_and_think_review.md- Task checkpoint review
pandas>=2.0.0
numpy>=1.24.0
scikit-learn==1.6.1
matplotlib>=3.7.0
seaborn>=0.12.0
joblib>=1.2.0- Python: 3.11+
- OS: Tested on WSL2 (Ubuntu on Windows)
- Memory: Models require ~2GB RAM for training
- Disk: ~500MB for all outputs
Main Task List: tasks/tasks-affordability-mobility-causal-analysis.md
Completed:
- โ Task 0: Project Setup
- โ Task 1.0: Data Preparation & Cleaning
- โ Task 2.0: Feature Engineering & Treatment Definition
In Progress:
- ๐ Task 3.0: Causal Analysis & Model Refinement
- ๐ Frontend Integration
Problem: EACCES: permission denied when running npm install
Solution:
cd fronted
rm -rf node_modules package-lock.json
npm cache clean --force
npm installOr move project to native WSL filesystem:
cp -r /mnt/c/path/to/datathon ~/datathon
cd ~/datathon/fronted
npm installProblem: Models saved with scikit-learn 1.6.1, but older version installed
Solution:
pip install --upgrade scikit-learn==1.6.1Problem: Privacy-suppressed binned earnings data
Solution: Switched to granular earnings column (Median Earnings of Dependent Students Working and Not Enrolled 10 Years After Entry)
- Create Flask backend API for model serving
- Integrate ML models into frontend application
- Add interactive visualizations (Plotly, D3.js)
- Implement causal inference methods (matching, IV)
- Add more sophisticated LLM integration (Ollama, GPT)
- Create Docker deployment configuration
- Add unit and integration tests
- Implement real-time data updates
- Build personalized recommendation system
| Document | Purpose |
|---|---|
| PROJECT_STRUCTURE.md | Complete project organization |
| api/README.md | Backend API documentation |
| llm/README.md | LLM and chatbot documentation |
| outputs/rf_analysis/models/README.md | Model loading and usage |
| outputs/rf_analysis/SUMMARY_REPORT.md | Analysis findings |
| outputs/logs/analysis_log.md | Detailed decision log |
| tasks/tasks-affordability-mobility-causal-analysis.md | Task list |
This is a DATATHON project. For questions or contributions:
- Review the task list in
/tasks/ - Check analysis logs in
/outputs/logs/ - Refer to specific module READMEs
- Follow existing code style and documentation practices
[Add license information if applicable]
[Add team member information if applicable]
For questions about this project, refer to:
- Task documentation in
/tasks/ - Analysis logs in
/outputs/logs/ - Module-specific READMEs
Last Updated: November 2025
Project Status: Active Development
Version: 2.0 (Post-Refactoring)