Benchmark results for LLMs running on NVIDIA DGX Spark GB10.
| Date | Model | Engine | Config | Gen t/s @ 16K | Notes |
|---|---|---|---|---|---|
| 2026-01-24 | GLM-4.7-Flash AWQ | vLLM | TRITON_MLA | 26.0 | Baseline |
| 2026-01-24 | GLM-4.7-Flash AWQ | vLLM | FLASHINFER+FP8 | 18.8 | 128K ctx |
- Engine: vLLM (scitrera/dgx-spark-vllm:0.14.0-t5)
- Quantization: AWQ 4-bit
- Backends Tested: TRITON_MLA, FLASHINFER+FP8
- Results: View all
- Learnings: FP8 vs MLA Trade-offs
| Backend | Baseline t/s | 8K t/s | 16K t/s | Max Context |
|---|---|---|---|---|
| TRITON_MLA | 41.6 | 33.1 | 26.0 | 32K |
| FLASHINFER+FP8 | 40.2 | 27.3 | 18.8 | 128K |
Verdict: TRITON_MLA is faster. Use FP8 only when >32K context is required.
Using llama-benchy for OpenAI-compatible endpoint testing.
uvx llama-benchy \
--base-url http://localhost:8000/v1 \
--model <model-name> \
--tokenizer <hf-tokenizer> \
--pp 2048 --tg 32 \
--depth 0 4096 8192 16384 24576 \
--runs 3 \
--enable-prefix-caching \
--latency-mode generationllm-benchmarks/
├── README.md # This file - index + history
├── CLAUDE.md # AI agent instructions
├── docs/
│ ├── guides/ # Usage documentation
│ └── learnings/ # Benchmark insights and comparisons
├── templates/
│ └── result.md # Template for new benchmarks
├── logs/
│ └── <model>/ # Raw benchmark output
└── results/
└── <model>/ # Human-readable summaries
- System: NVIDIA DGX Spark
- GPU: GB10 (Blackwell SM120, 128GB unified memory)
- Inference Engines: vLLM, llama.cpp
MIT