Skip to content

Conversation

@maryamtahhan
Copy link
Contributor

@maryamtahhan maryamtahhan commented Jan 23, 2026

Summary

This PR adds configurable saturation detection and optimisation parameters for testing CPU-based deployments/SUTs (vllm-cpu) with the sweep profile. CPU deployments saturate at much lower concurrency rates than GPU deployments (e.g., 16 concurrent requests vs 512), causing sweep tests to continue measuring beyond saturation and producing misleading "knee bend" artifacts in performance graphs. The new parameters allow users to detect saturation early, stop tests efficiently, and exclude anomalous throughput measurements from results.

Details

  • Added three new configurable parameters to SweepProfile:
    • exclude_throughput_target (default: false) - Stops constant-rate tests before reaching throughput level
    • exclude_throughput_result (default: false) - Excludes throughput benchmark from saved results
    • saturation_threshold (default: 0.98) - Efficiency threshold for detecting saturation (achieved/target rate)
  • Implemented saturation detection logic in SweepProfile.next_strategy() that stops sweep when efficiency drops below threshold
  • Added parameter propagation through:
    • Settings class for environment variable configuration (GUIDELLM__*)
    • CLI parameters in GenerativeBenchmarkEntrypoint
    • Profile resolution in SweepProfile.resolve_args()
  • Modified rate interpolation logic to support both GPU mode (include throughput target) and CPU mode (exclude throughput target)
  • Added comprehensive documentation in docs/getting-started/benchmark.md

Test Plan

Verified with vLLM-CPU deployment
Test saturation detection works correctly:

guidellm benchmark \
  --target "http://localhost:8000" \
  --profile sweep \
  --exclude-throughput-target true \
  --exclude-throughput-result true \
  --saturation-threshold 0.98 \
  --data "prompt_tokens=256,output_tokens=128" \
  --max-seconds 180

** Test that defaults work for GPU deployments**

guidellm benchmark \
  --target "http://localhost:8000" \
  --profile sweep \
  --data "prompt_tokens=256,output_tokens=128"
  • ✅ All parameters default to false/0.98 (GPU-friendly behavior)
  • ✅ Full sweep completes with throughput benchmark included

** Verify environment variable configuration **

export GUIDELLM__EXCLUDE_THROUGHPUT_TARGET=true
export GUIDELLM__EXCLUDE_THROUGHPUT_RESULT=true
export GUIDELLM__SATURATION_THRESHOLD=0.98

guidellm benchmark --target "http://localhost:8000" --profile sweep
  • ✅ Parameters correctly inherited from environment variables

** Run unit tests**

pytest tests/unit/data/test_builders.py tests/unit/test_settings.py -v
  • ✅ All 54 tests passed

Run pre-commit checks

pre-commit run --all-files
  • ✅ All checks passed (linter, formatter, whitespace, EOF)

Related Issues

N/A


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

maryamtahhan and others added 3 commits January 22, 2026 11:30
CPU deployments saturate at much lower concurrency rates than GPU
deployments (e.g., 8 vs 512), causing sweep tests to continue measuring
beyond saturation. This creates misleading performance graphs with
"knee bend" artifacts and wasted benchmark time.

This commit adds three configurable parameters to handle saturation:

1. exclude_throughput_target (default: false)
   - Stops constant-rate tests before reaching throughput level
   - Prevents generating tests at rates the system cannot sustain
   - Eliminates "elbow" artifacts in graphs

2. exclude_throughput_result (default: false)
   - Excludes throughput benchmark from saved results
   - Removes anomalous burst-capacity data points that create visual
     artifacts (TTFT spikes from ~70ms to 244ms, inter-token latency
     anomalies) in performance graphs

3. saturation_threshold (default: 0.98)
   - Automatically stops sweep when achieved rate < target × threshold
   - Detects saturation (e.g., system achieves 2.63 req/s when
     targeting 2.68 req/s = 98% efficiency)
   - Saves time by skipping unnecessary over-saturated tests

Parameters are configurable via CLI flags (--exclude-throughput-target)
or environment variables (GUIDELLM__EXCLUDE_THROUGHPUT_TARGET).

Defaults remain GPU-friendly (all disabled). CPU deployments should
enable both exclusion flags and tune saturation threshold as needed.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add important note explaining why max_concurrency should not be set
when running sweep profile tests. When max_concurrency is artificially
limited, the throughput test underestimates server capacity, causing
constant-rate tests to run at rates far below actual capacity. This
prevents proper saturation detection and can produce misleading results
where TTFT decreases instead of increases.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Expand the "How It Works" section to provide:
- Clear explanation of test execution order
- Detailed rationale for each parameter (why + effect)
- Explanation of how all three parameters work together
- Real-world example of throughput test outliers (23+ sec TTFT)

This helps users understand why all three parameters are recommended
for CPU deployments and how they complement each other to produce
clean, efficient benchmarks.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan maryamtahhan changed the title Cpu sweep saturation cpu - sweep saturation detection Jan 23, 2026
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
@maryamtahhan
Copy link
Contributor Author

maryamtahhan commented Jan 23, 2026

changes graphs from:

Screenshot 2026-01-23 at 10 13 58 Screenshot 2026-01-23 at 10 14 12

@maryamtahhan
Copy link
Contributor Author

To:
Screenshot 2026-01-22 at 12 50 21
Screenshot 2026-01-22 at 12 50 28

@maryamtahhan
Copy link
Contributor Author

and sweep stops gracefully when saturation is detected:

Screenshot 2026-01-23 at 10 17 14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant