Skip to content

An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.

License

Notifications You must be signed in to change notification settings

sshoecraft/shepherd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

139 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shepherd

Advanced Multi-Backend LLM System with Intelligent Memory Management

Shepherd is a production-grade C++ LLM inference system supporting both local models (llama.cpp, TensorRT-LLM) and cloud APIs (OpenAI, Anthropic, Gemini, Ollama). It features KV cache eviction for indefinite conversations, retrieval-augmented generation (RAG), and comprehensive tool/function calling.


Quick Start

Local model:

./shepherd -m /path/to/model.gguf

Cloud provider:

# Add a provider
shepherd provider add sonnet anthropic --model claude-sonnet-4 --api-key sk-ant-...

# Use it
./shepherd --provider sonnet

Build from source: See BUILD.md for prerequisites and installation.


Usage

Interactive Mode

$ ./shepherd --provider mylocal

Shepherd v2.1.0
Provider: mylocal (llamacpp)
Model: qwen3-30b-a3b
Context: 40960 tokens
Tools: 18 available

> What files are in the current directory?
* list_directory(path=".")

The current directory contains:
- main.cpp: Application entry point
- README.md: Project documentation
- Makefile: Build configuration
...

> /provider use sonnet
Switched to provider: sonnet

> Explain this code
...

Providers

Providers define backends you can switch between at runtime.

# List configured providers
shepherd provider list

# Add providers (action-first for creating new)
shepherd provider add local --type llamacpp --model /models/qwen-72b.gguf
shepherd provider add sonnet --type anthropic --model claude-sonnet-4 --api-key sk-ant-...
shepherd provider add gpt --type openai --model gpt-4o --api-key sk-...

# View/modify providers (name-first pattern)
shepherd provider sonnet show        # Show details
shepherd provider sonnet set model claude-sonnet-4-20250514  # Modify setting
shepherd provider sonnet use         # Switch to this provider
shepherd provider sonnet             # Show help for this provider

# In interactive mode
> /provider local use
> /provider next

Tools

Shepherd includes built-in tools across several categories:

Category Tools
Filesystem read, write, list_directory, delete_file, file_exists
Command shell (execute commands with timeout)
HTTP http_get, http_post, http_put, http_delete
JSON json_parse, json_validate, json_extract
Memory search_memory, set_fact, get_fact, store_memory
MCP list_mcp_resources, read_mcp_resource, plus dynamic server:tool
# List tools
shepherd tools list

# Enable/disable specific tools
shepherd tools enable shell
shepherd tools disable shell

# Disable all tools
./shepherd --notools

Configuration

Config File

Configuration is stored at ~/.config/shepherd/config.json:

{
    "streaming": true,
    "thinking": false,
    "tui": true,
    "max_db_size": "10G",
    "memory_database": "~/.local/share/shepherd/memory.db"
}
# View configuration
shepherd config show

# Set values (key-first shortcut)
shepherd config streaming true       # Set streaming to true
shepherd config max_db_size 20G      # Set max_db_size

# Or use explicit set
shepherd config set streaming true

# View single value
shepherd config streaming            # Shows current streaming value

# In interactive mode
> /config show
> /config streaming true

MCP Servers

Configure Model Context Protocol servers for external tool integration:

# List MCP servers
shepherd mcp list

# Add an MCP server (action-first for creating new)
shepherd mcp add mydb python /path/to/mcp_server.py -e DB_HOST=localhost

# View/modify servers (name-first pattern)
shepherd mcp mydb show               # Show server details
shepherd mcp mydb test               # Test connection
shepherd mcp mydb remove             # Remove server
shepherd mcp mydb                    # Show help for this server

SMCP Servers (Secure Credentials)

SMCP passes credentials to MCP servers via stdin, never in environment variables or CLI args:

# Add SMCP server with credentials
shepherd smcp add database smcp-postgres --cred DB_URL=postgresql://user:pass@host/db

Credentials are sent via the SMCP protocol handshake, never exposed in /proc, ps, or config files.

Azure Key Vault

Load configuration from Azure Key Vault using Managed Identity:

./shepherd --config msi --kv my-vault-name

Store a secret named shepherd-config containing the unified JSON config. The VM's managed identity needs "Key Vault Secrets User" role.

Environment Variables

SHEPHERD_INTERACTIVE=1    # Force interactive mode (useful in scripts/pipes)
NO_COLOR=1                # Disable colored output

Server Modes

Shepherd can run as a server for remote access or persistent sessions.

🌐 API Server (OpenAI-Compatible)

Exposes an OpenAI-compatible REST API for remote access to your local Shepherd instance.

./shepherd --server --port 8000

Use cases:

  • Access your home server's GPU from your laptop
  • Use OpenAI-compatible tools with local models
  • Integration with any OpenAI client library

Endpoints:

  • POST /v1/chat/completions - Chat completions (streaming supported)
  • GET /v1/models - List available models
  • GET /health - Health check

Authentication:

Generate API keys for clients to authenticate against the server (OpenAI-compatible Authorization: Bearer header). See docs/api_server.md for details.

./shepherd --server --auth-mode json
shepherd apikey add mykey    # Generates sk-shep-...

For full documentation, see docs/api_server.md.

🖥️ CLI Server (Persistent Session)

Runs a persistent AI session with server-side tool execution and multi-client access.

./shepherd --cliserver --port 8000

Use cases:

  • 24/7 AI assistant with full tool access
  • Query databases without exposing credentials to clients
  • Multiple clients see the same session via SSE streaming

Connect a client:

./shepherd --backend cli --api-base http://server:8000

For full documentation, see docs/cli_server.md.

🔗 Server Composability

Shepherd's architecture allows any backend with any frontend, and servers can be chained together.

Key principle: With API backends, each incoming connection creates a new backend connection - no session contention, fully scalable.

Example: API Proxy with Credential Isolation

Hide your Azure OpenAI credentials while adding tools and your own API keys:

# Shepherd connects to Azure OpenAI (credentials stay on server)
# Clients connect to Shepherd with your API keys
./shepherd --backend openai \
           --api-base https://mycompany.openai.azure.com/v1 \
           --api-key $AZURE_KEY \
           --server --port 8000 --auth-mode json --server-tools

# Generate keys for your clients
shepherd apikey add client1
shepherd apikey add client2

Clients get:

  • Access to Azure OpenAI without knowing the Azure credentials
  • Server-side tools (filesystem, shell, MCP servers)
  • Your access control via Shepherd API keys

Example: Persistent Session on vLLM

Use vLLM's multi-user capabilities with a persistent CLI session:

# vLLM server running on port 5000 (handles multiple users efficiently)
# Shepherd CLI server on top for persistent session + tools
./shepherd --backend openai \
           --api-base http://localhost:5000/v1 \
           --cliserver --port 8000

Now you have:

  • vLLM's PagedAttention for efficient multi-conversation handling
  • Shepherd's persistent session (all clients see same conversation)
  • Server-side tools executing locally

Example: Multi-Level Chaining

# Level 1: llamacpp backend
./shepherd --backend llamacpp -m /models/qwen-72b.gguf --server --port 5000

# Level 2: API server proxy (adds tools + API keys)
./shepherd --backend openai --api-base http://localhost:5000/v1 \
           --server --port 6000 --auth-mode json --server-tools

# Level 3: CLI server for persistent session
./shepherd --backend openai --api-base http://localhost:6000/v1 \
           --cliserver --port 7000

Features

🔄 Multi-Backend Architecture

Backend Type Models Context Tools
llama.cpp Local Llama, Qwen, Mistral, Gemma, etc. 8K-256K
TensorRT-LLM Local Same (NVIDIA optimized) 2K-256K
OpenAI Cloud GPT-5, GPT-4o, GPT-4 Turbo 128K-200K
Anthropic Cloud Claude Opus 4.5, Sonnet 4, Haiku 200K
Gemini Cloud Gemini 3, 2.5 Pro/Flash 32K-2M
Azure OpenAI Cloud GPT models via deployment 128K-200K
Ollama Local/Cloud Any Ollama model 8K-128K

📚 RAG System

Evicted messages are automatically archived to a SQLite database with FTS5 full-text search:

> Remember that the project deadline is March 15
* set_fact(key="project_deadline", value="March 15")

# Later, or in a new session...
> What's the project deadline?
* get_fact(key="project_deadline")

The project deadline is March 15.

Search archived conversations:

> Search my memory for discussions about authentication
* search_memory(query="authentication")

🤝 Multi-Model Collaboration

When multiple providers are configured, Shepherd creates ask_* tools for cross-model consultation:

# Using local model, ask Claude for code review
> ask_sonnet to read main.cpp and suggest improvements

* ask_sonnet(prompt="read main.cpp and suggest improvements")
  → Sonnet calls read(path="main.cpp")
  → Sonnet analyzes and responds

Claude's analysis appears in your local model's context.

Key feature: The ask_* tools have full tool access - the consulted model can read files, run commands, search memory, etc. You can chain consultations: ask Sonnet to ask GPT to analyze something.

The current provider is excluded (you don't ask yourself). Switch providers and the tools update automatically.

🧠 Automatic Session Eviction

Shepherd supports automatic eviction for indefinite conversations with any backend:

  • Local backends: Evicts when GPU KV cache fills
  • API backends: Evicts when API returns context full error, then retries
  • Manual limit: Use --context-size N to set a limit smaller than the backend's maximum
# Force eviction at 32K tokens even if backend supports more
./shepherd --provider azure --context-size 32768

Eviction behavior:

  • Oldest messages first (LRU), protecting system prompt and current context
  • Automatic archival to RAG database before eviction
  • Seamless continuation - conversation keeps going

For local backend implementation details, see docs/llamacpp.md.

⏰ Scheduling

Shepherd includes a cron-like scheduler that injects prompts into the session automatically. Works with CLI, TUI, and CLI server modes.

# Add a scheduled task (action-first for creating new)
shepherd sched add morning-news "0 9 * * *" "Get me the top 5 tech news headlines"

# List scheduled tasks
shepherd sched list

# View/modify schedules (name-first pattern)
shepherd sched morning-news show     # Show schedule details
shepherd sched morning-news disable  # Disable schedule
shepherd sched morning-news enable   # Enable schedule
shepherd sched morning-news remove   # Remove schedule

24/7 Operation: Run a CLI server and schedules execute automatically, even with no clients connected:

./shepherd --cliserver --port 8000

# Scheduled prompts run in the session:
# - "Check server disk usage" every hour
# - "Summarize overnight logs" at 6am
# - "Generate daily report" at 5pm

Clients connect to see results from scheduled tasks in the conversation history.


Command Reference

Subcommands

Command Description
shepherd provider <add|list|show|remove|use> Manage providers
shepherd config <show|set> View/modify configuration
shepherd tools <list|enable|disable> Manage tools
shepherd mcp <add|remove|list> Manage MCP servers
shepherd smcp <add|remove|list> Manage SMCP servers
shepherd sched <list|add|remove|enable|disable> Scheduled tasks
shepherd apikey <add|list|remove> API key management
shepherd edit-system Edit system prompt in $EDITOR

Common Flags

Flag Description
-p, --provider NAME Use specific provider
-m, --model PATH Model name or file
--backend TYPE Backend: llamacpp, openai, anthropic, etc.
--context-size N Context window size (0 = model default)
--server Start API server mode
--cliserver Start CLI server mode
--port N Server port (default: 8000)
--notools Disable all tools
--nostream Disable streaming output
--tui / --no-tui Enable/disable TUI mode
--config msi --kv VAULT Load config from Azure Key Vault

Run shepherd --help for the complete list.


Hardware Requirements

Minimum

  • GPU: NVIDIA GTX 1080 Ti (11GB VRAM) or better
  • RAM: 32GB system RAM
  • Storage: SATA SSD (500GB)

Recommended

  • GPU: 2x NVIDIA RTX 3090 (48GB VRAM)
  • RAM: 128GB system RAM
  • Storage: NVMe SSD (1TB+)

Cloud

  • AWS: g5.12xlarge (4x A10G)
  • GCP: a2-highgpu-4g (4x A100)
  • Azure: Standard_NC24ads_A100_v4

Performance

Throughput (70B model, batch_size=1)

Backend Prompt Speed Generation Speed Latency
TensorRT-LLM 8000 tok/s 45 tok/s ~50ms
llama.cpp (CUDA) 1200 tok/s 25 tok/s ~80ms
llama.cpp (CPU) 150 tok/s 8 tok/s ~200ms

Memory Usage (70B model)

Configuration VRAM System RAM Context
Q4_K_M + 64K ctx 38GB 8GB 65536
Q4_K_M + 128K ctx 42GB 12GB 131072
Q8_0 + 64K ctx 72GB 16GB 65536

Troubleshooting

Out of Memory During Inference

Reduce context size:

./shepherd --context-size 65536

Or use a more aggressive quantization (Q4_K_M instead of Q8_0).

Slow Generation Speed

Increase GPU layers or switch backends:

./shepherd --gpu-layers 48

KV Cache Issues

If you see repetitive or nonsensical output, the KV cache may be corrupted. Restart Shepherd to clear the cache.

For debug builds, use -d=3 for verbose KV cache logging.


Development

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        Frontend                              │
│  CLI, TUI, API Server, CLI Server                           │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────v──────────────────────────────────┐
│                     Session + Provider                       │
│  Message routing, provider switching, tool execution        │
└──────────────────────────┬──────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
┌───────v────┐      ┌──────v─────┐     ┌─────v─────────┐
│  LlamaCpp  │      │  TensorRT  │     │ API Backends  │
│  Backend   │      │  Backend   │     │ (5 types)     │
└────────────┘      └────────────┘     └───────────────┘

For detailed architecture, see docs/architecture.md.

Project Structure

shepherd/
├── main.cpp              # Application entry point
├── frontend.cpp/h        # Frontend abstraction
├── backend.cpp/h         # Backend base class
├── session.cpp/h         # Session management
├── provider.cpp/h        # Provider management
├── config.cpp/h          # Configuration
├── rag.cpp/h             # RAG system
├── server.cpp/h          # HTTP server base
│
├── backends/
│   ├── llamacpp.cpp/h    # llama.cpp backend
│   ├── tensorrt.cpp/h    # TensorRT-LLM backend
│   ├── openai.cpp/h      # OpenAI API
│   ├── anthropic.cpp/h   # Anthropic Claude
│   ├── gemini.cpp/h      # Google Gemini
│   ├── ollama.cpp/h      # Ollama
│   ├── api.cpp/h         # Base for API backends
│   └── factory.cpp/h     # Backend factory
│
├── frontends/
│   ├── cli.cpp/h         # CLI frontend
│   ├── tui.cpp/h         # TUI frontend
│   ├── api_server.cpp/h  # API server
│   └── cli_server.cpp/h  # CLI server
│
├── tools/                # Tool implementations
├── mcp/                  # MCP client/server
└── Makefile              # Build system

Extending Shepherd


Contributing

Contributions welcome! Areas of interest:

  • Additional backend integrations
  • New tool implementations
  • Performance optimizations
  • Documentation improvements

Testing

# Build with tests enabled
echo "TESTS=ON" >> ~/.shepherd_opts
make

# Run tests
cd build && make test_unit test_tools
./tests/test_unit
./tests/test_tools

See docs/testing.md for the full test plan.


License

PolyForm Shield License 1.0.0

  • ✅ Use for any purpose (personal, commercial, internal)
  • ✅ Modify and create derivative works
  • ✅ Distribute copies
  • ❌ Sell Shepherd as a standalone product
  • ❌ Offer Shepherd as a paid service (SaaS)
  • ❌ Create competing products

See LICENSE for full text.


Acknowledgments

  • llama.cpp: Georgi Gerganov and contributors
  • TensorRT-LLM: NVIDIA Corporation
  • Model Context Protocol: Anthropic
  • SQLite: D. Richard Hipp

Contact