Skip to content

akarazhev/llama.cpp-config-scripts

Repository files navigation

llama.cpp Configuration Scripts

License: MIT Shell llama.cpp

Professional-grade management scripts for running llama.cpp server locally. These scripts provide reliable server management with comprehensive error handling, health checking, and monitoring capabilities.

📋 Table of Contents

🎯 Overview

This repository contains battle-tested Bash scripts for managing llama.cpp local LLM servers. Perfect for:

  • Development: Quick local LLM server for testing
  • Production: Reliable server management with health monitoring
  • CI/CD: Automated testing with local LLMs
  • Multi-agent Systems: Backend for AI agent frameworks

Supported Platforms:

  • ✅ macOS (Intel & Apple Silicon with Metal acceleration)
  • ✅ Linux (Ubuntu, Debian, RHEL, etc.)
  • ✅ Windows (via WSL or Git Bash)

✨ Features

  • 🚀 One-Command Startup: Start llama-server with sensible defaults
  • 🔄 Model Switching: Interactive tool to switch between models and quantizations
  • 🏥 Health Monitoring: Comprehensive health checks and status reporting
  • 🔄 Graceful Management: Safe start, stop, and restart operations
  • 🎨 Beautiful Output: Color-coded, user-friendly terminal output
  • ⚙️ Flexible Configuration: Environment variables for easy customization
  • 🔍 Detailed Diagnostics: Verbose mode for troubleshooting
  • 📊 Resource Monitoring: Track CPU, memory, and system usage
  • 🛡️ Error Handling: Robust error detection and recovery
  • 🍎 Metal Acceleration: Automatic GPU support on Apple Silicon

🚀 Quick Start

# 1. Clone the repository
git clone <repository-url>
cd llama.cpp-config-scripts

# 2. Install llama.cpp (if not already installed)
brew install llama.cpp  # macOS
# or build from source (see Installation section)

# 3. (Optional) Choose your model
./scripts/switch_model.sh  # Interactive model selection

# 4. Start the server
./scripts/start_llama_server.sh

# 5. Check server health
./scripts/check_llama_server.sh

# 6. When done, stop the server
./scripts/stop_llama_server.sh

That's it! The server will:

  • Download the model automatically (first run)
  • Start on http://127.0.0.1:8080
  • Enable GPU acceleration (if available)
  • Create logs in logs/llama-server.log

📦 Installation

Prerequisites

Required:

  • Bash 4.0+
  • curl
  • lsof (usually pre-installed)

llama.cpp Installation:

Choose one method:

Method 1: Homebrew (macOS, easiest)

brew install llama.cpp

Method 2: Build from Source (All Platforms)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
sudo make install  # Optional: install system-wide

Method 3: Python Package (Limited functionality)

pip install llama-cpp-python[server]

Verify Installation

llama-server --version

📖 Scripts Reference

Core Management Scripts

start_llama_server.sh

Purpose: Start llama-server with comprehensive validation

Features:

  • Automatic model download from HuggingFace
  • Port conflict detection and resolution
  • System resource checking
  • Configuration validation
  • Health check verification
  • Log file management (auto-rotation)

Usage:

# Default configuration (UD-Q4_K_XL quantization)
./scripts/start_llama_server.sh

# Use higher quality Q8_0 model
MODEL_QUANTIZATION=Q8_0 ./scripts/start_llama_server.sh

# Custom port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh

# Custom context size
LLAMA_CTX_SIZE=32768 ./scripts/start_llama_server.sh

# Combine multiple settings
LLAMA_PORT=8081 LLAMA_CTX_SIZE=8192 ./scripts/start_llama_server.sh

stop_llama_server.sh

Purpose: Gracefully stop llama-server

Features:

  • Graceful shutdown (SIGTERM)
  • Force stop option (SIGKILL) if needed
  • Port cleanup verification
  • PID file management
  • Zombie process detection

Usage:

./scripts/stop_llama_server.sh

check_llama_server.sh

Purpose: Comprehensive health and diagnostics

Features:

  • Process status verification
  • Network port checking
  • HTTP connectivity testing
  • API health endpoint validation
  • Model availability checking
  • Inference endpoint testing
  • System resource reporting
  • Log file analysis

Usage:

# Standard check
./scripts/check_llama_server.sh

# Verbose mode with detailed diagnostics
./scripts/check_llama_server.sh --verbose

# Help
./scripts/check_llama_server.sh --help

Exit Codes:

  • 0: Server is healthy
  • 1: Server has issues or is not running

restart_llama_server.sh

Purpose: Safely restart the server

Features:

  • Safe stop-then-start sequence
  • Health check waiting
  • Automatic verification
  • Error recovery

Usage:

./scripts/restart_llama_server.sh

Utility Scripts

switch_model.sh

Purpose: Interactive model and quantization switcher

Features:

  • Interactive model selection from curated list
  • Quantization level selection with RAM requirements
  • Save configuration for future use
  • List available models and quantizations
  • Show current configuration

Usage:

# Interactive mode (default)
./scripts/switch_model.sh

# List available models
./scripts/switch_model.sh --list

# List available quantizations
./scripts/switch_model.sh --list-quant

# Show current configuration
./scripts/switch_model.sh --show

# Direct model set
./scripts/switch_model.sh devstral-small Q8_0
./scripts/switch_model.sh qwen-coder-32b Q6_K
./scripts/switch_model.sh llama-3.2-3b Q4_K_M

# Get help
./scripts/switch_model.sh --help

Available Models:

  • devstral-small - Devstral-Small-2-24B-Instruct (default)
  • qwen-coder-32b - Qwen2.5-Coder-32B-Instruct
  • llama-3.2-3b - Llama-3.2-3B-Instruct
  • deepseek-coder - DeepSeek-Coder-V2-Lite-Instruct

Available Quantizations:

  • Q8_0 - ~24GB RAM, highest quality
  • Q6_K - ~18GB RAM, very good quality
  • Q5_K_M - ~15GB RAM, good quality
  • Q4_K_M - ~12GB RAM, decent quality
  • UD-Q4_K_XL - ~12GB RAM, optimized (default)
  • Q3_K_M - ~9GB RAM, lower quality
  • Q2_K - ~6GB RAM, lowest quality

⚙️ Configuration

Environment Variables

Configure server behavior using environment variables:

Variable Default Description
MODEL_QUANTIZATION UD-Q4_K_XL Model quantization level
LLAMA_MODEL unsloth/Devstral-Small-2-24B... Full HuggingFace model path
LLAMA_HOST 127.0.0.1 Server bind address
LLAMA_PORT 8080 Server port
LLAMA_CTX_SIZE 16384 Context window size (tokens)
LLAMA_GPU_LAYERS 99 GPU layers (-1 or 99 = all)
LLAMA_THREADS -1 CPU threads (-1 = auto)
LLAMA_BATCH_SIZE 512 Batch processing size
LLAMA_PARALLEL 4 Parallel request slots
LLAMA_LOG_LEVEL info Logging verbosity

Model Quantization Options

Option 1: Set Quantization Level (Recommended)

export MODEL_QUANTIZATION="UD-Q4_K_XL"  # Smaller, faster (~12GB)
# or
export MODEL_QUANTIZATION="Q8_0"        # Higher quality (~24GB)

Option 2: Set Full Model Path

export LLAMA_MODEL="unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0"

Configuration Files

Create a .env file in the project root:

# .env
MODEL_QUANTIZATION=UD-Q4_K_XL
LLAMA_HOST=127.0.0.1
LLAMA_PORT=8080
LLAMA_CTX_SIZE=16384
LLAMA_GPU_LAYERS=99
LLAMA_THREADS=-1

Then source it before running scripts:

source .env && ./scripts/start_llama_server.sh

📚 Usage Examples

Basic Server Management

# Start server
./scripts/start_llama_server.sh

# Check if healthy
./scripts/check_llama_server.sh

# Restart server
./scripts/restart_llama_server.sh

# Stop server
./scripts/stop_llama_server.sh

Model Switching

# Interactive model switcher
./scripts/switch_model.sh

# Quick model switch
./scripts/switch_model.sh devstral-small Q8_0
./scripts/switch_model.sh qwen-coder-32b Q6_K
./scripts/switch_model.sh llama-3.2-3b Q4_K_M

# List available options
./scripts/switch_model.sh --list
./scripts/switch_model.sh --list-quant

# Apply changes (restart required)
./scripts/restart_llama_server.sh

Custom Configuration

# Use higher quality model with more context
MODEL_QUANTIZATION=Q8_0 LLAMA_CTX_SIZE=32768 ./scripts/start_llama_server.sh

# Run on different port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh

# CPU-only mode (no GPU)
LLAMA_GPU_LAYERS=0 ./scripts/start_llama_server.sh

# Use different model directly
LLAMA_MODEL="Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q6_K" ./scripts/start_llama_server.sh

Testing the Server

# Quick health check
curl http://127.0.0.1:8080/health

# List available models
curl http://127.0.0.1:8080/v1/models

# Test inference
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

CI/CD Integration

#!/bin/bash
# ci-test.sh

# Start server
./scripts/start_llama_server.sh

# Wait for healthy status
max_wait=60
waited=0
while [ $waited -lt $max_wait ]; do
    if ./scripts/check_llama_server.sh; then
        echo "Server ready!"
        break
    fi
    echo "Waiting for server..."
    sleep 5
    waited=$((waited + 5))
done

# Run your tests
pytest tests/

# Cleanup
./scripts/stop_llama_server.sh

Systemd Service (Linux)

Create /etc/systemd/system/llama-server.service:

[Unit]
Description=llama.cpp LLM Server
After=network.target

[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/llama.cpp-config-scripts
Environment="MODEL_QUANTIZATION=UD-Q4_K_XL"
Environment="LLAMA_PORT=8080"
ExecStart=/path/to/llama.cpp-config-scripts/scripts/start_llama_server.sh
ExecStop=/path/to/llama.cpp-config-scripts/scripts/stop_llama_server.sh
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-server

🔧 Troubleshooting

Server Won't Start

Problem: Port already in use

# Solution: Check what's using the port
lsof -i :8080

# Or use a different port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh

Problem: Process dies immediately

# Solution: Check logs
tail -f logs/llama-server.log

# Check system resources
./scripts/check_llama_server.sh --verbose

Performance Issues

Problem: Slow inference

# Try smaller context size
LLAMA_CTX_SIZE=8192 ./scripts/start_llama_server.sh

# Use smaller quantization
MODEL_QUANTIZATION=UD-Q4_K_XL ./scripts/start_llama_server.sh

# Increase threads (use physical cores count)
LLAMA_THREADS=8 ./scripts/start_llama_server.sh

Problem: Out of memory

# Use smaller model quantization
MODEL_QUANTIZATION=UD-Q4_K_XL ./scripts/start_llama_server.sh

# Reduce context size
LLAMA_CTX_SIZE=4096 ./scripts/start_llama_server.sh

# Disable GPU (use CPU only)
LLAMA_GPU_LAYERS=0 ./scripts/start_llama_server.sh

Connection Issues

Problem: Can't connect to server

# Check if server is actually running
./scripts/check_llama_server.sh

# Test direct connection
curl http://127.0.0.1:8080/health

# Check firewall (macOS)
sudo pfctl -s rules | grep 8080

# Check firewall (Linux)
sudo iptables -L | grep 8080

Common Error Messages

Error Cause Solution
"llama-server not found" llama.cpp not installed Install with brew install llama.cpp
"Port already in use" Another process on port Use lsof -i :8080 to identify
"Model download failed" Network/HuggingFace issue Check internet connection, try again
"Health check timeout" Model still loading Wait 1-2 minutes, model is loading
"Out of memory" Insufficient RAM Use smaller model or reduce context

📁 Project Structure

llama.cpp-config-scripts/
├── scripts/                    # Shell scripts for server management
│   ├── start_llama_server.sh   # Start server with validation
│   ├── stop_llama_server.sh    # Stop server gracefully
│   ├── check_llama_server.sh   # Health check and diagnostics
│   ├── restart_llama_server.sh # Restart server safely
│   └── switch_model.sh         # Interactive model switcher
├── docs/                       # Documentation and configuration
│   ├── README.md               # Documentation index
│   ├── config.toml             # Example configuration
│   ├── install.txt             # Installation notes
│   └── server.sh               # Basic server command
├── logs/                       # Generated at runtime
│   ├── llama-server.log        # Server output log
│   └── llama-server.pid        # Process ID file
├── .llama_config               # Model configuration (generated)
├── env.example                 # Example environment configuration
├── .gitignore                  # Git ignore rules
├── LICENSE                     # MIT License
├── README.md                   # This file
├── CONTRIBUTING.md             # Contribution guidelines
└── CHANGELOG.md                # Version history

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

Quick contribution guidelines:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • llama.cpp - The amazing LLM inference engine
  • Unsloth - For the Devstral model
  • All contributors who have helped improve these scripts

📞 Support

  • Documentation: Check this README and docs/ folder
  • Issues: Open an issue on GitHub
  • Logs: Check logs/llama-server.log for detailed information

Made with ❤️ for the LLM community

Last Updated: January 2026 Version: 1.0.0

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages