A database-driven machine learning project that predicts whether a quarterback will throw at least one touchdown in an NFL game. The system combines player statistics, game context, and historical performance data to deliver transparent predictions.
Predict whether an NFL quarterback will throw a touchdown pass in a game using past performance, player profile data, and game context.
- Database-backed storage using SQLite
- Real-time predictions powered by a trained model
- Historical prediction tracking with confidence scoring
- Command-line workflow for data ingestion and modeling
machine-learning-nfl-touchdowns/
|-- data/
| |-- raw/ # Original CSV file
| `-- processed/ # Cleaned and engineered datasets
|-- src/
| |-- database.py # Database management
| |-- data_loader.py # Load CSV data into the database
| |-- data_validator.py # Data quality validation
| |-- preprocess.py # Database-driven preprocessing
| |-- train_model.py # Model training
| `-- explain_shap.py # Model explainability
|-- models/
| |-- qb_td_model.keras # TensorFlow SavedModel artifact
| |-- feature_scaler.pkl # StandardScaler used during training
| `-- training_metrics.json # Cross-validation & evaluation metrics
|-- notebooks/
| `-- eda.ipynb # Exploratory data analysis
|-- main.py # Main orchestration script
|-- requirements.txt # Python dependencies
`-- README.md # Project documentation
| Component | Technology | Purpose |
|---|---|---|
| Database | SQLite | Data storage and management |
| Data Processing | pandas, numpy | Data manipulation and analysis |
| Machine Learning | scikit-learn, TensorFlow (Keras) | Model training and prediction |
| API & Orchestration | FastAPI, Uvicorn | Optional service layer |
| Validation | Custom validation framework | Data quality assurance |
| Orchestration | Python scripts | Workflow automation |
pip install -r requirements.txtpython main.py --workflow --train-modelThis command loads data into the database, validates data quality, preprocesses features, and trains the TensorFlow touchdown model. Append --generate-shap to export a SHAP summary plot.
Run reproducible tasks with make: make backend-test, make frontend-test, make seed, and make docker-up.
- Refresh data assets
python main.py --workflow --train-model --generate-shap- Confirm artifacts exist in
models/qb_td_model.keras,models/feature_scaler.pkl, andmodels/training_metrics.json.
- Run automated tests
make backend-testmake frontend-test
- Provision environment variables (see
enhanced-nfl-platform/backend/app/core/config.pyfor defaults).- Set
DATABASE_URL,MODEL_PATH(default/app/models),EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2, and secrets in your deployment target.
- Set
- Launch the stack with Docker Compose
cd enhanced-nfl-platformdocker-compose up --build
- Verify services
- API health:
http://localhost:8000/health - Frontend:
http://localhost:3000 - Review logs:
docker-compose logs -f
- API health:
# Set up the database
python main.py --setup
# Validate data quality
python main.py --validate
# Preprocess data only
python main.py --preprocess
# Train or retrain the TensorFlow model (requires processed data)
python main.py --train-model
# Force retraining even if a model already exists
python main.py --train-model --force-train
# Generate a SHAP summary (requires trained model and processed data)
python main.py --generate-shap
# Launch the app only
python main.py --app
# Check project status
python main.py --status
# Force a full workflow reload
python main.py --workflow --force-reloadCore tables:
basic_stats: Player demographics and physical informationgame_logs: Game-by-game performance recordsqb_stats: Quarterback-specific game statisticscareer_stats: Season-level career statisticsqb_career_passing: Career passing statisticspredictions: Model prediction history
Key relationships:
- Players linked by
player_id - Game logs linked to quarterback stats by
game_log_id - Career stats linked to passing stats by
career_id
| Metric | Value |
|---|---|
| Accuracy | 88% |
| F1 Score | 85% |
| ROC-AUC | 91% |
Run python main.py --train-model to refresh metrics; the detailed cross-validation and test results are persisted in models/training_metrics.json after each training session.
from src.data_loader import NFLDataLoader
loader = NFLDataLoader()
loader.load_all_data()from src.data_validator import NFLDataValidator
validator = NFLDataValidator()
results = validator.validate_all_data()# Generate a SHAP beeswarm plot after training the TensorFlow model
python src/explain_shap.pyThe script loads the SavedModel (models/qb_td_model.keras), applies the stored scaler, and
writes a summary plot to models/shap_summary.png highlighting the strongest drivers of a
touchdown prediction.
from src.database import NFLDatabase
db = NFLDatabase()
db.connect()
# Get quarterback data for prediction
qb_data = db.get_qb_data_for_prediction("player_id_123")
# Save prediction
db.save_prediction(
player_id="player_id_123",
game_date="2024-01-15",
opponent="KC",
prediction=1,
confidence=0.85,
features_used='{"age": 28, "passing_yards": 275}'
)The NFLDataValidator module includes checks for:
- Data completeness and missing values
- Consistency across related tables
- Reasonable value ranges for statistics
- Duplicate detection and cleanup
- Valid game dates
- Quarterback-specific data quality rules
- Python 3.8 or newer
- Use a virtual environment (
python -m venv .venv) for isolation - Format code with
blackorruff format - Run linting with
ruffwhere available
- Ensure the SQLite database file
nfl_data.dbis accessible and not locked - Verify that required CSV files are present in
data/raw/ - Re-run
python main.py --workflow --force-reloadafter major data changes
This project is released under the MIT License. See the LICENSE file for details.
The project includes scripts for data ingestion, validation, model training, interpretability, and interactive exploration. Extend the workflow by adding new data sources, refining feature engineering, or experimenting with alternative models.