Automated parameter tuning for LLM inference engines (SGLang, vLLM) for best performance, while respecting SLOs and hardware constraints.
Quantization and parameter tuning can unlock 60%+ performance gains. LLM inference engines like SGLang and vLLM ship with conservative defaults that work everywhere but are optimized for nowhere.
Testing on NVIDIA RTX 4090 (24GB) with typical production workloads (mixed prefill/decode).
See detailed benchmarks: Baseline Benchmarks
| What You Get | Manual Tuning | Autotuner |
|---|---|---|
| Time to optimal config | Hours to Days | Minutes |
| Parameter combinations tested | ~10 (limited by patience) | 50-100+ (automated) |
| Performance gain | Unknown (untested) | 60%+ throughput (quantization + tuning) |
| Reproducibility | Low (manual errors) | High (versioned configs) |
| Cross-hardware portability | Manual rework | Re-run task (one command) |
- Task: A tuning job containing model config, parameter ranges, SLOs, and optimization strategy
- Experiment: Individual trial with specific parameter values; multiple experiments per task
- ARQ Worker: Background processor that deploys models, runs benchmarks, and scores results
- Multiple Deployment Modes: Docker, Local (direct GPU), OME (Kubernetes)
- Web UI: React frontend with real-time monitoring
- Agent Assistant: LLM-powered assistant for task management and troubleshooting
- Optimization Strategies: Grid search, Bayesian optimization
- SLO-Aware Scoring: Exponential penalties for constraint violations
→ Get started in 5 minutes with Docker
# Install
pip install -r requirements.txt && pip install genai-bench
# Run
python src/run_autotuner.py examples/docker_task.yaml --mode docker# Start backend + worker
./scripts/start_dev.sh
# Start frontend (separate terminal)
cd frontend && npm run devAccess at http://localhost:5173
- ROADMAP.md - Product roadmap with completed milestones and future plans
- Installation Guide - Complete installation guide
- Quick Start - Quick start tutorial
- Docker Mode - Docker deployment guide
- Kubernetes/OME - Kubernetes/OME setup
- SLO Scoring - SLO-aware scoring with exponential penalties
- Parallel Execution - Parallel experiment execution
- WebSocket Implementation - Real-time updates via WebSocket
- Quantization Parameters - Quantization configuration
- Parameter Presets - Parameter preset system
- Bayesian Optimization - Bayesian optimization strategy
- GPU Tracking - GPU intelligent scheduling
- Troubleshooting - Common issues and solutions
See DEVELOPMENT for development guidelines and project architecture.