Professional-grade management scripts for running llama.cpp server locally. These scripts provide reliable server management with comprehensive error handling, health checking, and monitoring capabilities.
- Overview
- Features
- Quick Start
- Installation
- Scripts Reference
- Configuration
- Usage Examples
- Troubleshooting
- Project Structure
- Contributing
- License
This repository contains battle-tested Bash scripts for managing llama.cpp local LLM servers. Perfect for:
- Development: Quick local LLM server for testing
- Production: Reliable server management with health monitoring
- CI/CD: Automated testing with local LLMs
- Multi-agent Systems: Backend for AI agent frameworks
Supported Platforms:
- ✅ macOS (Intel & Apple Silicon with Metal acceleration)
- ✅ Linux (Ubuntu, Debian, RHEL, etc.)
- ✅ Windows (via WSL or Git Bash)
- 🚀 One-Command Startup: Start llama-server with sensible defaults
- 🔄 Model Switching: Interactive tool to switch between models and quantizations
- 🏥 Health Monitoring: Comprehensive health checks and status reporting
- 🔄 Graceful Management: Safe start, stop, and restart operations
- 🎨 Beautiful Output: Color-coded, user-friendly terminal output
- ⚙️ Flexible Configuration: Environment variables for easy customization
- 🔍 Detailed Diagnostics: Verbose mode for troubleshooting
- 📊 Resource Monitoring: Track CPU, memory, and system usage
- 🛡️ Error Handling: Robust error detection and recovery
- 🍎 Metal Acceleration: Automatic GPU support on Apple Silicon
# 1. Clone the repository
git clone <repository-url>
cd llama.cpp-config-scripts
# 2. Install llama.cpp (if not already installed)
brew install llama.cpp # macOS
# or build from source (see Installation section)
# 3. (Optional) Choose your model
./scripts/switch_model.sh # Interactive model selection
# 4. Start the server
./scripts/start_llama_server.sh
# 5. Check server health
./scripts/check_llama_server.sh
# 6. When done, stop the server
./scripts/stop_llama_server.shThat's it! The server will:
- Download the model automatically (first run)
- Start on http://127.0.0.1:8080
- Enable GPU acceleration (if available)
- Create logs in
logs/llama-server.log
Required:
- Bash 4.0+
- curl
- lsof (usually pre-installed)
llama.cpp Installation:
Choose one method:
brew install llama.cppgit clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
sudo make install # Optional: install system-widepip install llama-cpp-python[server]llama-server --versionPurpose: Start llama-server with comprehensive validation
Features:
- Automatic model download from HuggingFace
- Port conflict detection and resolution
- System resource checking
- Configuration validation
- Health check verification
- Log file management (auto-rotation)
Usage:
# Default configuration (UD-Q4_K_XL quantization)
./scripts/start_llama_server.sh
# Use higher quality Q8_0 model
MODEL_QUANTIZATION=Q8_0 ./scripts/start_llama_server.sh
# Custom port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh
# Custom context size
LLAMA_CTX_SIZE=32768 ./scripts/start_llama_server.sh
# Combine multiple settings
LLAMA_PORT=8081 LLAMA_CTX_SIZE=8192 ./scripts/start_llama_server.shPurpose: Gracefully stop llama-server
Features:
- Graceful shutdown (SIGTERM)
- Force stop option (SIGKILL) if needed
- Port cleanup verification
- PID file management
- Zombie process detection
Usage:
./scripts/stop_llama_server.shPurpose: Comprehensive health and diagnostics
Features:
- Process status verification
- Network port checking
- HTTP connectivity testing
- API health endpoint validation
- Model availability checking
- Inference endpoint testing
- System resource reporting
- Log file analysis
Usage:
# Standard check
./scripts/check_llama_server.sh
# Verbose mode with detailed diagnostics
./scripts/check_llama_server.sh --verbose
# Help
./scripts/check_llama_server.sh --helpExit Codes:
0: Server is healthy1: Server has issues or is not running
Purpose: Safely restart the server
Features:
- Safe stop-then-start sequence
- Health check waiting
- Automatic verification
- Error recovery
Usage:
./scripts/restart_llama_server.shPurpose: Interactive model and quantization switcher
Features:
- Interactive model selection from curated list
- Quantization level selection with RAM requirements
- Save configuration for future use
- List available models and quantizations
- Show current configuration
Usage:
# Interactive mode (default)
./scripts/switch_model.sh
# List available models
./scripts/switch_model.sh --list
# List available quantizations
./scripts/switch_model.sh --list-quant
# Show current configuration
./scripts/switch_model.sh --show
# Direct model set
./scripts/switch_model.sh devstral-small Q8_0
./scripts/switch_model.sh qwen-coder-32b Q6_K
./scripts/switch_model.sh llama-3.2-3b Q4_K_M
# Get help
./scripts/switch_model.sh --helpAvailable Models:
devstral-small- Devstral-Small-2-24B-Instruct (default)qwen-coder-32b- Qwen2.5-Coder-32B-Instructllama-3.2-3b- Llama-3.2-3B-Instructdeepseek-coder- DeepSeek-Coder-V2-Lite-Instruct
Available Quantizations:
Q8_0- ~24GB RAM, highest qualityQ6_K- ~18GB RAM, very good qualityQ5_K_M- ~15GB RAM, good qualityQ4_K_M- ~12GB RAM, decent qualityUD-Q4_K_XL- ~12GB RAM, optimized (default)Q3_K_M- ~9GB RAM, lower qualityQ2_K- ~6GB RAM, lowest quality
Configure server behavior using environment variables:
| Variable | Default | Description |
|---|---|---|
MODEL_QUANTIZATION |
UD-Q4_K_XL |
Model quantization level |
LLAMA_MODEL |
unsloth/Devstral-Small-2-24B... |
Full HuggingFace model path |
LLAMA_HOST |
127.0.0.1 |
Server bind address |
LLAMA_PORT |
8080 |
Server port |
LLAMA_CTX_SIZE |
16384 |
Context window size (tokens) |
LLAMA_GPU_LAYERS |
99 |
GPU layers (-1 or 99 = all) |
LLAMA_THREADS |
-1 |
CPU threads (-1 = auto) |
LLAMA_BATCH_SIZE |
512 |
Batch processing size |
LLAMA_PARALLEL |
4 |
Parallel request slots |
LLAMA_LOG_LEVEL |
info |
Logging verbosity |
Option 1: Set Quantization Level (Recommended)
export MODEL_QUANTIZATION="UD-Q4_K_XL" # Smaller, faster (~12GB)
# or
export MODEL_QUANTIZATION="Q8_0" # Higher quality (~24GB)Option 2: Set Full Model Path
export LLAMA_MODEL="unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0"Create a .env file in the project root:
# .env
MODEL_QUANTIZATION=UD-Q4_K_XL
LLAMA_HOST=127.0.0.1
LLAMA_PORT=8080
LLAMA_CTX_SIZE=16384
LLAMA_GPU_LAYERS=99
LLAMA_THREADS=-1Then source it before running scripts:
source .env && ./scripts/start_llama_server.sh# Start server
./scripts/start_llama_server.sh
# Check if healthy
./scripts/check_llama_server.sh
# Restart server
./scripts/restart_llama_server.sh
# Stop server
./scripts/stop_llama_server.sh# Interactive model switcher
./scripts/switch_model.sh
# Quick model switch
./scripts/switch_model.sh devstral-small Q8_0
./scripts/switch_model.sh qwen-coder-32b Q6_K
./scripts/switch_model.sh llama-3.2-3b Q4_K_M
# List available options
./scripts/switch_model.sh --list
./scripts/switch_model.sh --list-quant
# Apply changes (restart required)
./scripts/restart_llama_server.sh# Use higher quality model with more context
MODEL_QUANTIZATION=Q8_0 LLAMA_CTX_SIZE=32768 ./scripts/start_llama_server.sh
# Run on different port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh
# CPU-only mode (no GPU)
LLAMA_GPU_LAYERS=0 ./scripts/start_llama_server.sh
# Use different model directly
LLAMA_MODEL="Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q6_K" ./scripts/start_llama_server.sh# Quick health check
curl http://127.0.0.1:8080/health
# List available models
curl http://127.0.0.1:8080/v1/models
# Test inference
curl http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "devstral",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'#!/bin/bash
# ci-test.sh
# Start server
./scripts/start_llama_server.sh
# Wait for healthy status
max_wait=60
waited=0
while [ $waited -lt $max_wait ]; do
if ./scripts/check_llama_server.sh; then
echo "Server ready!"
break
fi
echo "Waiting for server..."
sleep 5
waited=$((waited + 5))
done
# Run your tests
pytest tests/
# Cleanup
./scripts/stop_llama_server.shCreate /etc/systemd/system/llama-server.service:
[Unit]
Description=llama.cpp LLM Server
After=network.target
[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/llama.cpp-config-scripts
Environment="MODEL_QUANTIZATION=UD-Q4_K_XL"
Environment="LLAMA_PORT=8080"
ExecStart=/path/to/llama.cpp-config-scripts/scripts/start_llama_server.sh
ExecStop=/path/to/llama.cpp-config-scripts/scripts/stop_llama_server.sh
Restart=on-failure
RestartSec=30
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-serverProblem: Port already in use
# Solution: Check what's using the port
lsof -i :8080
# Or use a different port
LLAMA_PORT=8081 ./scripts/start_llama_server.shProblem: Process dies immediately
# Solution: Check logs
tail -f logs/llama-server.log
# Check system resources
./scripts/check_llama_server.sh --verboseProblem: Slow inference
# Try smaller context size
LLAMA_CTX_SIZE=8192 ./scripts/start_llama_server.sh
# Use smaller quantization
MODEL_QUANTIZATION=UD-Q4_K_XL ./scripts/start_llama_server.sh
# Increase threads (use physical cores count)
LLAMA_THREADS=8 ./scripts/start_llama_server.shProblem: Out of memory
# Use smaller model quantization
MODEL_QUANTIZATION=UD-Q4_K_XL ./scripts/start_llama_server.sh
# Reduce context size
LLAMA_CTX_SIZE=4096 ./scripts/start_llama_server.sh
# Disable GPU (use CPU only)
LLAMA_GPU_LAYERS=0 ./scripts/start_llama_server.shProblem: Can't connect to server
# Check if server is actually running
./scripts/check_llama_server.sh
# Test direct connection
curl http://127.0.0.1:8080/health
# Check firewall (macOS)
sudo pfctl -s rules | grep 8080
# Check firewall (Linux)
sudo iptables -L | grep 8080| Error | Cause | Solution |
|---|---|---|
| "llama-server not found" | llama.cpp not installed | Install with brew install llama.cpp |
| "Port already in use" | Another process on port | Use lsof -i :8080 to identify |
| "Model download failed" | Network/HuggingFace issue | Check internet connection, try again |
| "Health check timeout" | Model still loading | Wait 1-2 minutes, model is loading |
| "Out of memory" | Insufficient RAM | Use smaller model or reduce context |
llama.cpp-config-scripts/
├── scripts/ # Shell scripts for server management
│ ├── start_llama_server.sh # Start server with validation
│ ├── stop_llama_server.sh # Stop server gracefully
│ ├── check_llama_server.sh # Health check and diagnostics
│ ├── restart_llama_server.sh # Restart server safely
│ └── switch_model.sh # Interactive model switcher
├── docs/ # Documentation and configuration
│ ├── README.md # Documentation index
│ ├── config.toml # Example configuration
│ ├── install.txt # Installation notes
│ └── server.sh # Basic server command
├── logs/ # Generated at runtime
│ ├── llama-server.log # Server output log
│ └── llama-server.pid # Process ID file
├── .llama_config # Model configuration (generated)
├── env.example # Example environment configuration
├── .gitignore # Git ignore rules
├── LICENSE # MIT License
├── README.md # This file
├── CONTRIBUTING.md # Contribution guidelines
└── CHANGELOG.md # Version history
Contributions are welcome! Please see CONTRIBUTING.md for details.
Quick contribution guidelines:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- llama.cpp - The amazing LLM inference engine
- Unsloth - For the Devstral model
- All contributors who have helped improve these scripts
- Documentation: Check this README and docs/ folder
- Issues: Open an issue on GitHub
- Logs: Check
logs/llama-server.logfor detailed information
Made with ❤️ for the LLM community
Last Updated: January 2026 Version: 1.0.0