Advanced Multi-Backend LLM System with Intelligent Memory Management
Shepherd is a production-grade C++ LLM inference system supporting both local models (llama.cpp, TensorRT-LLM) and cloud APIs (OpenAI, Anthropic, Gemini, Ollama). It features KV cache eviction for indefinite conversations, retrieval-augmented generation (RAG), and comprehensive tool/function calling.
Local model:
./shepherd -m /path/to/model.ggufCloud provider:
# Add a provider
shepherd provider add sonnet anthropic --model claude-sonnet-4 --api-key sk-ant-...
# Use it
./shepherd --provider sonnetBuild from source: See BUILD.md for prerequisites and installation.
$ ./shepherd --provider mylocal
Shepherd v2.1.0
Provider: mylocal (llamacpp)
Model: qwen3-30b-a3b
Context: 40960 tokens
Tools: 18 available
> What files are in the current directory?
* list_directory(path=".")
The current directory contains:
- main.cpp: Application entry point
- README.md: Project documentation
- Makefile: Build configuration
...
> /provider use sonnet
Switched to provider: sonnet
> Explain this code
...Providers define backends you can switch between at runtime.
# List configured providers
shepherd provider list
# Add providers (action-first for creating new)
shepherd provider add local --type llamacpp --model /models/qwen-72b.gguf
shepherd provider add sonnet --type anthropic --model claude-sonnet-4 --api-key sk-ant-...
shepherd provider add gpt --type openai --model gpt-4o --api-key sk-...
# View/modify providers (name-first pattern)
shepherd provider sonnet show # Show details
shepherd provider sonnet set model claude-sonnet-4-20250514 # Modify setting
shepherd provider sonnet use # Switch to this provider
shepherd provider sonnet # Show help for this provider
# In interactive mode
> /provider local use
> /provider nextShepherd includes built-in tools across several categories:
| Category | Tools |
|---|---|
| Filesystem | read, write, list_directory, delete_file, file_exists |
| Command | shell (execute commands with timeout) |
| HTTP | http_get, http_post, http_put, http_delete |
| JSON | json_parse, json_validate, json_extract |
| Memory | search_memory, set_fact, get_fact, store_memory |
| MCP | list_mcp_resources, read_mcp_resource, plus dynamic server:tool |
# List tools
shepherd tools list
# Enable/disable specific tools
shepherd tools enable shell
shepherd tools disable shell
# Disable all tools
./shepherd --notoolsConfiguration is stored at ~/.config/shepherd/config.json:
{
"streaming": true,
"thinking": false,
"tui": true,
"max_db_size": "10G",
"memory_database": "~/.local/share/shepherd/memory.db"
}# View configuration
shepherd config show
# Set values (key-first shortcut)
shepherd config streaming true # Set streaming to true
shepherd config max_db_size 20G # Set max_db_size
# Or use explicit set
shepherd config set streaming true
# View single value
shepherd config streaming # Shows current streaming value
# In interactive mode
> /config show
> /config streaming trueConfigure Model Context Protocol servers for external tool integration:
# List MCP servers
shepherd mcp list
# Add an MCP server (action-first for creating new)
shepherd mcp add mydb python /path/to/mcp_server.py -e DB_HOST=localhost
# View/modify servers (name-first pattern)
shepherd mcp mydb show # Show server details
shepherd mcp mydb test # Test connection
shepherd mcp mydb remove # Remove server
shepherd mcp mydb # Show help for this serverSMCP passes credentials to MCP servers via stdin, never in environment variables or CLI args:
# Add SMCP server with credentials
shepherd smcp add database smcp-postgres --cred DB_URL=postgresql://user:pass@host/dbCredentials are sent via the SMCP protocol handshake, never exposed in /proc, ps, or config files.
Load configuration from Azure Key Vault using Managed Identity:
./shepherd --config msi --kv my-vault-nameStore a secret named shepherd-config containing the unified JSON config. The VM's managed identity needs "Key Vault Secrets User" role.
SHEPHERD_INTERACTIVE=1 # Force interactive mode (useful in scripts/pipes)
NO_COLOR=1 # Disable colored outputShepherd can run as a server for remote access or persistent sessions.
Exposes an OpenAI-compatible REST API for remote access to your local Shepherd instance.
./shepherd --server --port 8000Use cases:
- Access your home server's GPU from your laptop
- Use OpenAI-compatible tools with local models
- Integration with any OpenAI client library
Endpoints:
POST /v1/chat/completions- Chat completions (streaming supported)GET /v1/models- List available modelsGET /health- Health check
Authentication:
Generate API keys for clients to authenticate against the server (OpenAI-compatible Authorization: Bearer header). See docs/api_server.md for details.
./shepherd --server --auth-mode json
shepherd apikey add mykey # Generates sk-shep-...For full documentation, see docs/api_server.md.
Runs a persistent AI session with server-side tool execution and multi-client access.
./shepherd --cliserver --port 8000Use cases:
- 24/7 AI assistant with full tool access
- Query databases without exposing credentials to clients
- Multiple clients see the same session via SSE streaming
Connect a client:
./shepherd --backend cli --api-base http://server:8000For full documentation, see docs/cli_server.md.
Shepherd's architecture allows any backend with any frontend, and servers can be chained together.
Key principle: With API backends, each incoming connection creates a new backend connection - no session contention, fully scalable.
Hide your Azure OpenAI credentials while adding tools and your own API keys:
# Shepherd connects to Azure OpenAI (credentials stay on server)
# Clients connect to Shepherd with your API keys
./shepherd --backend openai \
--api-base https://mycompany.openai.azure.com/v1 \
--api-key $AZURE_KEY \
--server --port 8000 --auth-mode json --server-tools
# Generate keys for your clients
shepherd apikey add client1
shepherd apikey add client2Clients get:
- Access to Azure OpenAI without knowing the Azure credentials
- Server-side tools (filesystem, shell, MCP servers)
- Your access control via Shepherd API keys
Use vLLM's multi-user capabilities with a persistent CLI session:
# vLLM server running on port 5000 (handles multiple users efficiently)
# Shepherd CLI server on top for persistent session + tools
./shepherd --backend openai \
--api-base http://localhost:5000/v1 \
--cliserver --port 8000Now you have:
- vLLM's PagedAttention for efficient multi-conversation handling
- Shepherd's persistent session (all clients see same conversation)
- Server-side tools executing locally
# Level 1: llamacpp backend
./shepherd --backend llamacpp -m /models/qwen-72b.gguf --server --port 5000
# Level 2: API server proxy (adds tools + API keys)
./shepherd --backend openai --api-base http://localhost:5000/v1 \
--server --port 6000 --auth-mode json --server-tools
# Level 3: CLI server for persistent session
./shepherd --backend openai --api-base http://localhost:6000/v1 \
--cliserver --port 7000| Backend | Type | Models | Context | Tools |
|---|---|---|---|---|
| llama.cpp | Local | Llama, Qwen, Mistral, Gemma, etc. | 8K-256K | ✓ |
| TensorRT-LLM | Local | Same (NVIDIA optimized) | 2K-256K | ✓ |
| OpenAI | Cloud | GPT-5, GPT-4o, GPT-4 Turbo | 128K-200K | ✓ |
| Anthropic | Cloud | Claude Opus 4.5, Sonnet 4, Haiku | 200K | ✓ |
| Gemini | Cloud | Gemini 3, 2.5 Pro/Flash | 32K-2M | ✓ |
| Azure OpenAI | Cloud | GPT models via deployment | 128K-200K | ✓ |
| Ollama | Local/Cloud | Any Ollama model | 8K-128K | ✓ |
Evicted messages are automatically archived to a SQLite database with FTS5 full-text search:
> Remember that the project deadline is March 15
* set_fact(key="project_deadline", value="March 15")
# Later, or in a new session...
> What's the project deadline?
* get_fact(key="project_deadline")
The project deadline is March 15.Search archived conversations:
> Search my memory for discussions about authentication
* search_memory(query="authentication")When multiple providers are configured, Shepherd creates ask_* tools for cross-model consultation:
# Using local model, ask Claude for code review
> ask_sonnet to read main.cpp and suggest improvements
* ask_sonnet(prompt="read main.cpp and suggest improvements")
→ Sonnet calls read(path="main.cpp")
→ Sonnet analyzes and responds
Claude's analysis appears in your local model's context.Key feature: The ask_* tools have full tool access - the consulted model can read files, run commands, search memory, etc. You can chain consultations: ask Sonnet to ask GPT to analyze something.
The current provider is excluded (you don't ask yourself). Switch providers and the tools update automatically.
Shepherd supports automatic eviction for indefinite conversations with any backend:
- Local backends: Evicts when GPU KV cache fills
- API backends: Evicts when API returns context full error, then retries
- Manual limit: Use
--context-size Nto set a limit smaller than the backend's maximum
# Force eviction at 32K tokens even if backend supports more
./shepherd --provider azure --context-size 32768Eviction behavior:
- Oldest messages first (LRU), protecting system prompt and current context
- Automatic archival to RAG database before eviction
- Seamless continuation - conversation keeps going
For local backend implementation details, see docs/llamacpp.md.
Shepherd includes a cron-like scheduler that injects prompts into the session automatically. Works with CLI, TUI, and CLI server modes.
# Add a scheduled task (action-first for creating new)
shepherd sched add morning-news "0 9 * * *" "Get me the top 5 tech news headlines"
# List scheduled tasks
shepherd sched list
# View/modify schedules (name-first pattern)
shepherd sched morning-news show # Show schedule details
shepherd sched morning-news disable # Disable schedule
shepherd sched morning-news enable # Enable schedule
shepherd sched morning-news remove # Remove schedule24/7 Operation: Run a CLI server and schedules execute automatically, even with no clients connected:
./shepherd --cliserver --port 8000
# Scheduled prompts run in the session:
# - "Check server disk usage" every hour
# - "Summarize overnight logs" at 6am
# - "Generate daily report" at 5pmClients connect to see results from scheduled tasks in the conversation history.
| Command | Description |
|---|---|
shepherd provider <add|list|show|remove|use> |
Manage providers |
shepherd config <show|set> |
View/modify configuration |
shepherd tools <list|enable|disable> |
Manage tools |
shepherd mcp <add|remove|list> |
Manage MCP servers |
shepherd smcp <add|remove|list> |
Manage SMCP servers |
shepherd sched <list|add|remove|enable|disable> |
Scheduled tasks |
shepherd apikey <add|list|remove> |
API key management |
shepherd edit-system |
Edit system prompt in $EDITOR |
| Flag | Description |
|---|---|
-p, --provider NAME |
Use specific provider |
-m, --model PATH |
Model name or file |
--backend TYPE |
Backend: llamacpp, openai, anthropic, etc. |
--context-size N |
Context window size (0 = model default) |
--server |
Start API server mode |
--cliserver |
Start CLI server mode |
--port N |
Server port (default: 8000) |
--notools |
Disable all tools |
--nostream |
Disable streaming output |
--tui / --no-tui |
Enable/disable TUI mode |
--config msi --kv VAULT |
Load config from Azure Key Vault |
Run shepherd --help for the complete list.
- GPU: NVIDIA GTX 1080 Ti (11GB VRAM) or better
- RAM: 32GB system RAM
- Storage: SATA SSD (500GB)
- GPU: 2x NVIDIA RTX 3090 (48GB VRAM)
- RAM: 128GB system RAM
- Storage: NVMe SSD (1TB+)
- AWS: g5.12xlarge (4x A10G)
- GCP: a2-highgpu-4g (4x A100)
- Azure: Standard_NC24ads_A100_v4
| Backend | Prompt Speed | Generation Speed | Latency |
|---|---|---|---|
| TensorRT-LLM | 8000 tok/s | 45 tok/s | ~50ms |
| llama.cpp (CUDA) | 1200 tok/s | 25 tok/s | ~80ms |
| llama.cpp (CPU) | 150 tok/s | 8 tok/s | ~200ms |
| Configuration | VRAM | System RAM | Context |
|---|---|---|---|
| Q4_K_M + 64K ctx | 38GB | 8GB | 65536 |
| Q4_K_M + 128K ctx | 42GB | 12GB | 131072 |
| Q8_0 + 64K ctx | 72GB | 16GB | 65536 |
Reduce context size:
./shepherd --context-size 65536Or use a more aggressive quantization (Q4_K_M instead of Q8_0).
Increase GPU layers or switch backends:
./shepherd --gpu-layers 48If you see repetitive or nonsensical output, the KV cache may be corrupted. Restart Shepherd to clear the cache.
For debug builds, use -d=3 for verbose KV cache logging.
┌─────────────────────────────────────────────────────────────┐
│ Frontend │
│ CLI, TUI, API Server, CLI Server │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────────v──────────────────────────────────┐
│ Session + Provider │
│ Message routing, provider switching, tool execution │
└──────────────────────────┬──────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌───────v────┐ ┌──────v─────┐ ┌─────v─────────┐
│ LlamaCpp │ │ TensorRT │ │ API Backends │
│ Backend │ │ Backend │ │ (5 types) │
└────────────┘ └────────────┘ └───────────────┘
For detailed architecture, see docs/architecture.md.
shepherd/
├── main.cpp # Application entry point
├── frontend.cpp/h # Frontend abstraction
├── backend.cpp/h # Backend base class
├── session.cpp/h # Session management
├── provider.cpp/h # Provider management
├── config.cpp/h # Configuration
├── rag.cpp/h # RAG system
├── server.cpp/h # HTTP server base
│
├── backends/
│ ├── llamacpp.cpp/h # llama.cpp backend
│ ├── tensorrt.cpp/h # TensorRT-LLM backend
│ ├── openai.cpp/h # OpenAI API
│ ├── anthropic.cpp/h # Anthropic Claude
│ ├── gemini.cpp/h # Google Gemini
│ ├── ollama.cpp/h # Ollama
│ ├── api.cpp/h # Base for API backends
│ └── factory.cpp/h # Backend factory
│
├── frontends/
│ ├── cli.cpp/h # CLI frontend
│ ├── tui.cpp/h # TUI frontend
│ ├── api_server.cpp/h # API server
│ └── cli_server.cpp/h # CLI server
│
├── tools/ # Tool implementations
├── mcp/ # MCP client/server
└── Makefile # Build system
- Adding backends: See docs/backends.md
- Adding tools: See docs/tools.md (if exists) or
tools/tool.h
Contributions welcome! Areas of interest:
- Additional backend integrations
- New tool implementations
- Performance optimizations
- Documentation improvements
# Build with tests enabled
echo "TESTS=ON" >> ~/.shepherd_opts
make
# Run tests
cd build && make test_unit test_tools
./tests/test_unit
./tests/test_toolsSee docs/testing.md for the full test plan.
PolyForm Shield License 1.0.0
- ✅ Use for any purpose (personal, commercial, internal)
- ✅ Modify and create derivative works
- ✅ Distribute copies
- ❌ Sell Shepherd as a standalone product
- ❌ Offer Shepherd as a paid service (SaaS)
- ❌ Create competing products
See LICENSE for full text.
- llama.cpp: Georgi Gerganov and contributors
- TensorRT-LLM: NVIDIA Corporation
- Model Context Protocol: Anthropic
- SQLite: D. Richard Hipp