Skip to content

seanGSISG/llm-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Benchmarks

Benchmark results for LLMs running on NVIDIA DGX Spark GB10.

Quick Reference

Date Model Engine Config Gen t/s @ 16K Notes
2026-01-24 GLM-4.7-Flash AWQ vLLM TRITON_MLA 26.0 Baseline
2026-01-24 GLM-4.7-Flash AWQ vLLM FLASHINFER+FP8 18.8 128K ctx

Models Tested

GLM-4.7-Flash AWQ

  • Engine: vLLM (scitrera/dgx-spark-vllm:0.14.0-t5)
  • Quantization: AWQ 4-bit
  • Backends Tested: TRITON_MLA, FLASHINFER+FP8
  • Results: View all
  • Learnings: FP8 vs MLA Trade-offs

Performance Summary

Backend Baseline t/s 8K t/s 16K t/s Max Context
TRITON_MLA 41.6 33.1 26.0 32K
FLASHINFER+FP8 40.2 27.3 18.8 128K

Verdict: TRITON_MLA is faster. Use FP8 only when >32K context is required.

Benchmark Tool

Using llama-benchy for OpenAI-compatible endpoint testing.

Quick Start

uvx llama-benchy \
  --base-url http://localhost:8000/v1 \
  --model <model-name> \
  --tokenizer <hf-tokenizer> \
  --pp 2048 --tg 32 \
  --depth 0 4096 8192 16384 24576 \
  --runs 3 \
  --enable-prefix-caching \
  --latency-mode generation

Repository Structure

llm-benchmarks/
├── README.md          # This file - index + history
├── CLAUDE.md          # AI agent instructions
├── docs/
│   ├── guides/        # Usage documentation
│   └── learnings/     # Benchmark insights and comparisons
├── templates/
│   └── result.md      # Template for new benchmarks
├── logs/
│   └── <model>/       # Raw benchmark output
└── results/
    └── <model>/       # Human-readable summaries

Hardware

  • System: NVIDIA DGX Spark
  • GPU: GB10 (Blackwell SM120, 128GB unified memory)
  • Inference Engines: vLLM, llama.cpp

License

MIT

About

LLM benchmark results for NVIDIA DGX Spark GB10

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published