llama.cpp Configuration Scripts

Professional-grade management scripts for running llama.cpp server locally. These scripts provide reliable server management with comprehensive error handling, health checking, and monitoring capabilities.

📋 Table of Contents

Overview
Features
Quick Start
Installation
Scripts Reference
Configuration
Usage Examples
Troubleshooting
Project Structure
Contributing
License

🎯 Overview

This repository contains battle-tested Bash scripts for managing llama.cpp local LLM servers. Perfect for:

Development: Quick local LLM server for testing
Production: Reliable server management with health monitoring
CI/CD: Automated testing with local LLMs
Multi-agent Systems: Backend for AI agent frameworks

Supported Platforms:

✅ macOS (Intel & Apple Silicon with Metal acceleration)
✅ Linux (Ubuntu, Debian, RHEL, etc.)
✅ Windows (via WSL or Git Bash)

✨ Features

🚀 One-Command Startup: Start llama-server with sensible defaults
🔄 Model Switching: Interactive tool to switch between models and quantizations
🏥 Health Monitoring: Comprehensive health checks and status reporting
🔄 Graceful Management: Safe start, stop, and restart operations
🎨 Beautiful Output: Color-coded, user-friendly terminal output
⚙️ Flexible Configuration: Environment variables for easy customization
🔍 Detailed Diagnostics: Verbose mode for troubleshooting
📊 Resource Monitoring: Track CPU, memory, and system usage
🛡️ Error Handling: Robust error detection and recovery
🍎 Metal Acceleration: Automatic GPU support on Apple Silicon

🚀 Quick Start

# 1. Clone the repository
git clone <repository-url>
cd llama.cpp-config-scripts

# 2. Install llama.cpp (if not already installed)
brew install llama.cpp  # macOS
# or build from source (see Installation section)

# 3. (Optional) Choose your model
./scripts/switch_model.sh  # Interactive model selection

# 4. Start the server
./scripts/start_llama_server.sh

# 5. Check server health
./scripts/check_llama_server.sh

# 6. When done, stop the server
./scripts/stop_llama_server.sh

That's it! The server will:

Download the model automatically (first run)
Start on http://127.0.0.1:8080
Enable GPU acceleration (if available)
Create logs in logs/llama-server.log

📦 Installation

Prerequisites

Required:

Bash 4.0+
curl
lsof (usually pre-installed)

llama.cpp Installation:

Choose one method:

Method 1: Homebrew (macOS, easiest)

brew install llama.cpp

Method 2: Build from Source (All Platforms)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
sudo make install  # Optional: install system-wide

Method 3: Python Package (Limited functionality)

pip install llama-cpp-python[server]

Verify Installation

llama-server --version

📖 Scripts Reference

Core Management Scripts

`start_llama_server.sh`

Purpose: Start llama-server with comprehensive validation

Features:

Automatic model download from HuggingFace
Port conflict detection and resolution
System resource checking
Configuration validation
Health check verification
Log file management (auto-rotation)

Usage:

# Default configuration (UD-Q4_K_XL quantization)
./scripts/start_llama_server.sh

# Use higher quality Q8_0 model
MODEL_QUANTIZATION=Q8_0 ./scripts/start_llama_server.sh

# Custom port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh

# Custom context size
LLAMA_CTX_SIZE=32768 ./scripts/start_llama_server.sh

# Combine multiple settings
LLAMA_PORT=8081 LLAMA_CTX_SIZE=8192 ./scripts/start_llama_server.sh

`stop_llama_server.sh`

Purpose: Gracefully stop llama-server

Features:

Graceful shutdown (SIGTERM)
Force stop option (SIGKILL) if needed
Port cleanup verification
PID file management
Zombie process detection

Usage:

./scripts/stop_llama_server.sh

`check_llama_server.sh`

Purpose: Comprehensive health and diagnostics

Features:

Process status verification
Network port checking
HTTP connectivity testing
API health endpoint validation
Model availability checking
Inference endpoint testing
System resource reporting
Log file analysis

Usage:

# Standard check
./scripts/check_llama_server.sh

# Verbose mode with detailed diagnostics
./scripts/check_llama_server.sh --verbose

# Help
./scripts/check_llama_server.sh --help

Exit Codes:

0: Server is healthy
1: Server has issues or is not running

`restart_llama_server.sh`

Purpose: Safely restart the server

Features:

Safe stop-then-start sequence
Health check waiting
Automatic verification
Error recovery

Usage:

./scripts/restart_llama_server.sh

Utility Scripts

`switch_model.sh`

Purpose: Interactive model and quantization switcher

Features:

Interactive model selection from curated list
Quantization level selection with RAM requirements
Save configuration for future use
List available models and quantizations
Show current configuration

Usage:

# Interactive mode (default)
./scripts/switch_model.sh

# List available models
./scripts/switch_model.sh --list

# List available quantizations
./scripts/switch_model.sh --list-quant

# Show current configuration
./scripts/switch_model.sh --show

# Direct model set
./scripts/switch_model.sh devstral-small Q8_0
./scripts/switch_model.sh qwen-coder-32b Q6_K
./scripts/switch_model.sh llama-3.2-3b Q4_K_M

# Get help
./scripts/switch_model.sh --help

Available Models:

devstral-small - Devstral-Small-2-24B-Instruct (default)
qwen-coder-32b - Qwen2.5-Coder-32B-Instruct
llama-3.2-3b - Llama-3.2-3B-Instruct
deepseek-coder - DeepSeek-Coder-V2-Lite-Instruct

Available Quantizations:

Q8_0 - ~24GB RAM, highest quality
Q6_K - ~18GB RAM, very good quality
Q5_K_M - ~15GB RAM, good quality
Q4_K_M - ~12GB RAM, decent quality
UD-Q4_K_XL - ~12GB RAM, optimized (default)
Q3_K_M - ~9GB RAM, lower quality
Q2_K - ~6GB RAM, lowest quality

⚙️ Configuration

Environment Variables

Configure server behavior using environment variables:

Variable	Default	Description
`MODEL_QUANTIZATION`	`UD-Q4_K_XL`	Model quantization level
`LLAMA_MODEL`	`unsloth/Devstral-Small-2-24B...`	Full HuggingFace model path
`LLAMA_HOST`	`127.0.0.1`	Server bind address
`LLAMA_PORT`	`8080`	Server port
`LLAMA_CTX_SIZE`	`16384`	Context window size (tokens)
`LLAMA_GPU_LAYERS`	`99`	GPU layers (-1 or 99 = all)
`LLAMA_THREADS`	`-1`	CPU threads (-1 = auto)
`LLAMA_BATCH_SIZE`	`512`	Batch processing size
`LLAMA_PARALLEL`	`4`	Parallel request slots
`LLAMA_LOG_LEVEL`	`info`	Logging verbosity

Model Quantization Options

Option 1: Set Quantization Level (Recommended)

export MODEL_QUANTIZATION="UD-Q4_K_XL"  # Smaller, faster (~12GB)
# or
export MODEL_QUANTIZATION="Q8_0"        # Higher quality (~24GB)

Option 2: Set Full Model Path

export LLAMA_MODEL="unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF:Q8_0"

Configuration Files

Create a .env file in the project root:

# .env
MODEL_QUANTIZATION=UD-Q4_K_XL
LLAMA_HOST=127.0.0.1
LLAMA_PORT=8080
LLAMA_CTX_SIZE=16384
LLAMA_GPU_LAYERS=99
LLAMA_THREADS=-1

Then source it before running scripts:

source .env && ./scripts/start_llama_server.sh

📚 Usage Examples

Basic Server Management

# Start server
./scripts/start_llama_server.sh

# Check if healthy
./scripts/check_llama_server.sh

# Restart server
./scripts/restart_llama_server.sh

# Stop server
./scripts/stop_llama_server.sh

Model Switching

# Interactive model switcher
./scripts/switch_model.sh

# Quick model switch
./scripts/switch_model.sh devstral-small Q8_0
./scripts/switch_model.sh qwen-coder-32b Q6_K
./scripts/switch_model.sh llama-3.2-3b Q4_K_M

# List available options
./scripts/switch_model.sh --list
./scripts/switch_model.sh --list-quant

# Apply changes (restart required)
./scripts/restart_llama_server.sh

Custom Configuration

# Use higher quality model with more context
MODEL_QUANTIZATION=Q8_0 LLAMA_CTX_SIZE=32768 ./scripts/start_llama_server.sh

# Run on different port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh

# CPU-only mode (no GPU)
LLAMA_GPU_LAYERS=0 ./scripts/start_llama_server.sh

# Use different model directly
LLAMA_MODEL="Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q6_K" ./scripts/start_llama_server.sh

Testing the Server

# Quick health check
curl http://127.0.0.1:8080/health

# List available models
curl http://127.0.0.1:8080/v1/models

# Test inference
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devstral",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

CI/CD Integration

#!/bin/bash
# ci-test.sh

# Start server
./scripts/start_llama_server.sh

# Wait for healthy status
max_wait=60
waited=0
while [ $waited -lt $max_wait ]; do
    if ./scripts/check_llama_server.sh; then
        echo "Server ready!"
        break
    fi
    echo "Waiting for server..."
    sleep 5
    waited=$((waited + 5))
done

# Run your tests
pytest tests/

# Cleanup
./scripts/stop_llama_server.sh

Systemd Service (Linux)

Create /etc/systemd/system/llama-server.service:

[Unit]
Description=llama.cpp LLM Server
After=network.target

[Service]
Type=simple
User=youruser
WorkingDirectory=/path/to/llama.cpp-config-scripts
Environment="MODEL_QUANTIZATION=UD-Q4_K_XL"
Environment="LLAMA_PORT=8080"
ExecStart=/path/to/llama.cpp-config-scripts/scripts/start_llama_server.sh
ExecStop=/path/to/llama.cpp-config-scripts/scripts/stop_llama_server.sh
Restart=on-failure
RestartSec=30

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable llama-server
sudo systemctl start llama-server
sudo systemctl status llama-server

🔧 Troubleshooting

Server Won't Start

Problem: Port already in use

# Solution: Check what's using the port
lsof -i :8080

# Or use a different port
LLAMA_PORT=8081 ./scripts/start_llama_server.sh

Problem: Process dies immediately

# Solution: Check logs
tail -f logs/llama-server.log

# Check system resources
./scripts/check_llama_server.sh --verbose

Performance Issues

Problem: Slow inference

# Try smaller context size
LLAMA_CTX_SIZE=8192 ./scripts/start_llama_server.sh

# Use smaller quantization
MODEL_QUANTIZATION=UD-Q4_K_XL ./scripts/start_llama_server.sh

# Increase threads (use physical cores count)
LLAMA_THREADS=8 ./scripts/start_llama_server.sh

Problem: Out of memory

# Use smaller model quantization
MODEL_QUANTIZATION=UD-Q4_K_XL ./scripts/start_llama_server.sh

# Reduce context size
LLAMA_CTX_SIZE=4096 ./scripts/start_llama_server.sh

# Disable GPU (use CPU only)
LLAMA_GPU_LAYERS=0 ./scripts/start_llama_server.sh

Connection Issues

Problem: Can't connect to server

# Check if server is actually running
./scripts/check_llama_server.sh

# Test direct connection
curl http://127.0.0.1:8080/health

# Check firewall (macOS)
sudo pfctl -s rules | grep 8080

# Check firewall (Linux)
sudo iptables -L | grep 8080

Common Error Messages

Error	Cause	Solution
"llama-server not found"	llama.cpp not installed	Install with `brew install llama.cpp`
"Port already in use"	Another process on port	Use `lsof -i :8080` to identify
"Model download failed"	Network/HuggingFace issue	Check internet connection, try again
"Health check timeout"	Model still loading	Wait 1-2 minutes, model is loading
"Out of memory"	Insufficient RAM	Use smaller model or reduce context

📁 Project Structure

llama.cpp-config-scripts/
├── scripts/                    # Shell scripts for server management
│   ├── start_llama_server.sh   # Start server with validation
│   ├── stop_llama_server.sh    # Stop server gracefully
│   ├── check_llama_server.sh   # Health check and diagnostics
│   ├── restart_llama_server.sh # Restart server safely
│   └── switch_model.sh         # Interactive model switcher
├── docs/                       # Documentation and configuration
│   ├── README.md               # Documentation index
│   ├── config.toml             # Example configuration
│   ├── install.txt             # Installation notes
│   └── server.sh               # Basic server command
├── logs/                       # Generated at runtime
│   ├── llama-server.log        # Server output log
│   └── llama-server.pid        # Process ID file
├── .llama_config               # Model configuration (generated)
├── env.example                 # Example environment configuration
├── .gitignore                  # Git ignore rules
├── LICENSE                     # MIT License
├── README.md                   # This file
├── CONTRIBUTING.md             # Contribution guidelines
└── CHANGELOG.md                # Version history

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

Quick contribution guidelines:

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

llama.cpp - The amazing LLM inference engine
Unsloth - For the Devstral model
All contributors who have helped improve these scripts

📞 Support

Documentation: Check this README and docs/ folder
Issues: Open an issue on GitHub
Logs: Check logs/llama-server.log for detailed information

Made with ❤️ for the LLM community

Last Updated: January 2026 Version: 1.0.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
BASH_COMPATIBILITY.md		BASH_COMPATIBILITY.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
ISSUES_AND_FIXES.md		ISSUES_AND_FIXES.md
LICENSE		LICENSE
README.md		README.md
REVIEW_SUMMARY.md		REVIEW_SUMMARY.md
env.example		env.example

License

akarazhev/llama.cpp-config-scripts

Folders and files

Latest commit

History

Repository files navigation

llama.cpp Configuration Scripts

📋 Table of Contents

🎯 Overview

✨ Features

🚀 Quick Start

📦 Installation

Prerequisites

Method 1: Homebrew (macOS, easiest)

Method 2: Build from Source (All Platforms)

Method 3: Python Package (Limited functionality)

Verify Installation

📖 Scripts Reference

Core Management Scripts

start_llama_server.sh

stop_llama_server.sh

check_llama_server.sh

restart_llama_server.sh

Utility Scripts

switch_model.sh

⚙️ Configuration

Environment Variables

Model Quantization Options

Configuration Files

📚 Usage Examples

Basic Server Management

Model Switching

Custom Configuration

Testing the Server

CI/CD Integration

Systemd Service (Linux)

🔧 Troubleshooting

Server Won't Start

Performance Issues

Connection Issues

Common Error Messages

📁 Project Structure

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`start_llama_server.sh`

`stop_llama_server.sh`

`check_llama_server.sh`

`restart_llama_server.sh`

`switch_model.sh`

Packages