An open-source tool that analyzes historical Spark job runs to recommend optimal resource configurations for future executions.
-
Multiple Data Collection Methods
- Parse Spark event logs
- Query Spark History Server API
- Integrate with cloud provider APIs (AWS EMR, Databricks, GCP Dataproc)
-
Intelligent Recommendations
- Similarity-based matching with historical jobs
- ML-powered predictions for runtime and resource needs
- Rule-based optimization for common anti-patterns
-
Cost Optimization
- Balance performance vs. cost trade-offs
- Support for spot/preemptible instances
- Multi-cloud cost comparison
-
Easy Integration
- REST API for programmatic access
- CLI for manual operations
- Python SDK for custom workflows
# Clone the repository
git clone https://github.com/gridatek/spark-optimizer.git
cd spark-optimizer
# Install dependencies
pip install -e .
# Optional: Install with cloud provider support
pip install -e ".[aws]" # For AWS EMR integration
pip install -e ".[gcp]" # For GCP Dataproc integration
# Setup database
python scripts/setup_db.py# Collect data from Spark event logs
spark-optimizer collect --event-log-dir /path/to/spark/logs
# Collect data from Spark History Server
spark-optimizer collect-from-history-server --history-server-url http://localhost:18080
# Collect data from AWS EMR
pip install -e ".[aws]"
spark-optimizer collect-from-emr --region us-west-2
# Collect data from Databricks (uses core dependencies)
spark-optimizer collect-from-databricks --workspace-url https://dbc-xxx.cloud.databricks.com
# Collect data from GCP Dataproc
pip install -e ".[gcp]"
spark-optimizer collect-from-dataproc --project my-project --region us-central1
# Get recommendations for a new job
spark-optimizer recommend --input-size 10GB --job-type etl
# Start the API server
spark-optimizer serve --port 8080The Spark Resource Optimizer is designed with a modular, layered architecture that separates concerns and allows for easy extension and maintenance.
┌─────────────────────────────────────────────────────────────┐
│ Client Applications │
│ (CLI, REST API Clients, Web Dashboard) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ REST API │ │ CLI │ │ WebSocket │ │
│ │ Routes │ │ Commands │ │ (Future) │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Business Logic Layer │
│ ┌──────────────────────┐ ┌─────────────────────┐ │
│ │ Recommender │ │ Analyzer │ │
│ │ - Similarity │ │ - Job Analysis │ │
│ │ - ML-based │ │ - Similarity │ │
│ │ - Rule-based │ │ - Features │ │
│ └──────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Access Layer │
│ ┌──────────────────────────────────────────────┐ │
│ │ Repository Pattern │ │
│ │ - SparkApplicationRepository │ │
│ │ - JobRecommendationRepository │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ │
│ │ SQLite │ │ PostgreSQL │ │ MySQL │ │
│ │ (Default) │ │ (Optional) │ │ (Optional) │ │
│ └─────────────┘ └─────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Collection Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Event Log │ │ History │ │ Metrics │ │
│ │ Collector │ │ Server │ │ Collector │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Data Sources │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Spark Event │ │ Spark │ │ Cloud APIs │ │
│ │ Logs │ │ History │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
Purpose: Gather Spark job metrics from various sources
Components:
BaseCollector: Abstract interface for all collectorsEventLogCollector: Parse Spark event log filesHistoryServerCollector: Query Spark History Server APIMetricsCollector: Integrate with monitoring systems
Key Features:
- Pluggable collector architecture
- Batch processing support
- Error handling and retry logic
- Data normalization
Purpose: Persist job data and recommendations
Components:
Database: Connection management and session handlingModels: SQLAlchemy ORM modelsSparkApplication: Job metadata and metricsSparkStage: Stage-level detailsJobRecommendation: Historical recommendations
Repository: Data access abstraction
Key Features:
- Database-agnostic design (SQLAlchemy)
- Transaction management
- Query optimization
- Migration support (Alembic)
Purpose: Analyze job characteristics and extract insights
Components:
JobAnalyzer: Performance analysis and bottleneck detectionJobSimilarityCalculator: Calculate job similarity scoresFeatureExtractor: Extract ML features from job data
Key Features:
- Resource efficiency metrics
- Bottleneck identification (CPU, memory, I/O)
- Issue detection (data skew, spills, failures)
- Similarity-based job matching
Purpose: Generate optimal resource configurations
Components:
BaseRecommender: Abstract recommender interfaceSimilarityRecommender: History-based recommendationsMLRecommender: ML model predictionsRuleBasedRecommender: Heuristic-based suggestions
Key Features:
- Multiple recommendation strategies
- Confidence scoring
- Cost-performance trade-offs
- Feedback loop integration
Purpose: Expose functionality to clients
Components:
- REST API (Flask)
- CLI interface (Click)
- WebSocket support (future)
Endpoints:
/recommend: Get resource recommendations/jobs: List and query historical jobs/analyze: Analyze specific jobs/feedback: Submit recommendation feedback
Event Logs → Collector → Parser → Normalizer → Repository → Database
User Request → API → Recommender → Analyzer → Repository → Database
↓
Recommendation ← Model/Rules ← Historical Data
Job ID → Repository → Job Data → Analyzer → Insights
↓
Feature Extraction
↓
Similarity Matching
For more detailed architecture information, see docs/architecture.md.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this tool in your research or production systems, please cite:
@software{spark_resource_optimizer,
title = {Spark Resource Optimizer},
author = {Gridatek},
year = {2024},
url = {https://github.com/gridatek/spark-optimizer}
}