An interactive LLM sampling demonstration using llama.cpp with a browser-based visualization.
Note: The probability distributions shown in the online demo were generated using the Gemma 3 1B model with Q4 quantization.
The interactive visualization supports the following token sampling techniques:
- Temperature - Scales token probabilities using a power function with optional dynamic temperature based on entropy
- Top-K - Keeps only the K highest probability tokens
- Top-P (Nucleus) - Keeps tokens until cumulative probability reaches P
- Min-P - Filters tokens based on minimum probability relative to the highest probability token
These samplers can be combined and reordered to explore different sampling strategies.
- Python 3.11+
- Docker (for llama.cpp server)
# 1. Install dependencies
make setup
# 2. Download a model, you can use any model you want in the GGUF format.
make download-model MODEL=DravenBlack/gemma-3-1b-it-q4_k_m-GGUF
# 3. Start the llama.cpp server (in one terminal), you can use any model you downloaded in the gguf format.
make server MODEL=gemma-3-1b-it-q4_k_m.gguf
# 4. In another terminal, generate completions from prompts
make generate-completions
# 5. View results in your browser
make http-server
# Open http://localhost:8000| Command | Description |
|---|---|
make setup |
Install Python dependencies |
make server |
Start llama.cpp server with default model |
make http-server |
Start HTTP server to view results |
make generate-completions |
Generate completions from prompts.json prompts |
make download-model |
Download a GGUF model |
make help |
Show all available commands |
Start server with a specific model:
make server MODEL=gemma-3-1b-it-q4_k_m.ggufDownload a new model:
make download-model MODEL=DravenBlack/gemma-3-1b-it-q4_k_m-GGUFEdit prompts.json with your prompts:
["Your prompt here", "Another prompt here"]Results are saved to web/probs.json.
