Skip to content

Optimize the performance of LLM inference engines by automatically tuning parameters for a specific model.

License

Notifications You must be signed in to change notification settings

novitalabs/autotuner

Repository files navigation

LLM Autotuner (for inference)

Autotuner Logo

Automated parameter tuning for LLM inference engines (SGLang, vLLM) for best performance, while respecting SLOs and hardware constraints.

Why Autotuner?

Quantization and parameter tuning can unlock 60%+ performance gains. LLM inference engines like SGLang and vLLM ship with conservative defaults that work everywhere but are optimized for nowhere.

Performance Impact: Real-World Data

Throughput Comparison

Latency Comparison

Testing on NVIDIA RTX 4090 (24GB) with typical production workloads (mixed prefill/decode).

See detailed benchmarks: Baseline Benchmarks

What You Get Manual Tuning Autotuner
Time to optimal config Hours to Days Minutes
Parameter combinations tested ~10 (limited by patience) 50-100+ (automated)
Performance gain Unknown (untested) 60%+ throughput (quantization + tuning)
Reproducibility Low (manual errors) High (versioned configs)
Cross-hardware portability Manual rework Re-run task (one command)

How to Use

CLI Mode

CLI Flow

Web UI Mode

Web UI Flow

Agent Mode

Agent Flow

Core Concepts

Core Concepts

  • Task: A tuning job containing model config, parameter ranges, SLOs, and optimization strategy
  • Experiment: Individual trial with specific parameter values; multiple experiments per task
  • ARQ Worker: Background processor that deploys models, runs benchmarks, and scores results

Features

  • Multiple Deployment Modes: Docker, Local (direct GPU), OME (Kubernetes)
  • Web UI: React frontend with real-time monitoring
  • Agent Assistant: LLM-powered assistant for task management and troubleshooting
  • Optimization Strategies: Grid search, Bayesian optimization
  • SLO-Aware Scoring: Exponential penalties for constraint violations

Quick Start

Get started in 5 minutes with Docker

# Install
pip install -r requirements.txt && pip install genai-bench

# Run
python src/run_autotuner.py examples/docker_task.yaml --mode docker

Web UI

# Start backend + worker
./scripts/start_dev.sh

# Start frontend (separate terminal)
cd frontend && npm run dev

Access at http://localhost:5173

Documentation

Full Documentation

Project Overview

  • ROADMAP.md - Product roadmap with completed milestones and future plans

Setup & Deployment

Features & Configuration

Operations & Troubleshooting

Contributing

See DEVELOPMENT for development guidelines and project architecture.

About

Optimize the performance of LLM inference engines by automatically tuning parameters for a specific model.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 3

  •  
  •  
  •