Shepherd

Advanced Multi-Backend LLM System with Intelligent Memory Management

Shepherd is a production-grade C++ LLM inference system supporting both local models (llama.cpp, TensorRT-LLM) and cloud APIs (OpenAI, Anthropic, Gemini, Ollama). It features KV cache eviction for indefinite conversations, retrieval-augmented generation (RAG), and comprehensive tool/function calling.

Quick Start

Local model:

./shepherd -m /path/to/model.gguf

Cloud provider:

# Add a provider
shepherd provider add sonnet anthropic --model claude-sonnet-4 --api-key sk-ant-...

# Use it
./shepherd --provider sonnet

Build from source: See BUILD.md for prerequisites and installation.

Usage

Interactive Mode

$ ./shepherd --provider mylocal

Shepherd v2.1.0
Provider: mylocal (llamacpp)
Model: qwen3-30b-a3b
Context: 40960 tokens
Tools: 18 available

> What files are in the current directory?
* list_directory(path=".")

The current directory contains:
- main.cpp: Application entry point
- README.md: Project documentation
- Makefile: Build configuration
...

> /provider use sonnet
Switched to provider: sonnet

> Explain this code
...

Providers

Providers define backends you can switch between at runtime.

# List configured providers
shepherd provider list

# Add providers (action-first for creating new)
shepherd provider add local --type llamacpp --model /models/qwen-72b.gguf
shepherd provider add sonnet --type anthropic --model claude-sonnet-4 --api-key sk-ant-...
shepherd provider add gpt --type openai --model gpt-4o --api-key sk-...

# View/modify providers (name-first pattern)
shepherd provider sonnet show        # Show details
shepherd provider sonnet set model claude-sonnet-4-20250514  # Modify setting
shepherd provider sonnet use         # Switch to this provider
shepherd provider sonnet             # Show help for this provider

# In interactive mode
> /provider local use
> /provider next

Tools

Shepherd includes built-in tools across several categories:

Category	Tools
Filesystem	read, write, list_directory, delete_file, file_exists
Command	shell (execute commands with timeout)
HTTP	http_get, http_post, http_put, http_delete
JSON	json_parse, json_validate, json_extract
Memory	search_memory, set_fact, get_fact, store_memory
MCP	list_mcp_resources, read_mcp_resource, plus dynamic `server:tool`

# List tools
shepherd tools list

# Enable/disable specific tools
shepherd tools enable shell
shepherd tools disable shell

# Disable all tools
./shepherd --notools

Configuration

Config File

Configuration is stored at ~/.config/shepherd/config.json:

{
    "streaming": true,
    "thinking": false,
    "tui": true,
    "max_db_size": "10G",
    "memory_database": "~/.local/share/shepherd/memory.db"
}

# View configuration
shepherd config show

# Set values (key-first shortcut)
shepherd config streaming true       # Set streaming to true
shepherd config max_db_size 20G      # Set max_db_size

# Or use explicit set
shepherd config set streaming true

# View single value
shepherd config streaming            # Shows current streaming value

# In interactive mode
> /config show
> /config streaming true

MCP Servers

Configure Model Context Protocol servers for external tool integration:

# List MCP servers
shepherd mcp list

# Add an MCP server (action-first for creating new)
shepherd mcp add mydb python /path/to/mcp_server.py -e DB_HOST=localhost

# View/modify servers (name-first pattern)
shepherd mcp mydb show               # Show server details
shepherd mcp mydb test               # Test connection
shepherd mcp mydb remove             # Remove server
shepherd mcp mydb                    # Show help for this server

SMCP Servers (Secure Credentials)

SMCP passes credentials to MCP servers via stdin, never in environment variables or CLI args:

# Add SMCP server with credentials
shepherd smcp add database smcp-postgres --cred DB_URL=postgresql://user:pass@host/db

Credentials are sent via the SMCP protocol handshake, never exposed in /proc, ps, or config files.

Azure Key Vault

Load configuration from Azure Key Vault using Managed Identity:

./shepherd --config msi --kv my-vault-name

Store a secret named shepherd-config containing the unified JSON config. The VM's managed identity needs "Key Vault Secrets User" role.

Environment Variables

SHEPHERD_INTERACTIVE=1    # Force interactive mode (useful in scripts/pipes)
NO_COLOR=1                # Disable colored output

Server Modes

Shepherd can run as a server for remote access or persistent sessions.

🌐 API Server (OpenAI-Compatible)

Exposes an OpenAI-compatible REST API for remote access to your local Shepherd instance.

./shepherd --server --port 8000

Use cases:

Access your home server's GPU from your laptop
Use OpenAI-compatible tools with local models
Integration with any OpenAI client library

Endpoints:

POST /v1/chat/completions - Chat completions (streaming supported)
GET /v1/models - List available models
GET /health - Health check

Authentication:

Generate API keys for clients to authenticate against the server (OpenAI-compatible Authorization: Bearer header). See docs/api_server.md for details.

./shepherd --server --auth-mode json
shepherd apikey add mykey    # Generates sk-shep-...

For full documentation, see docs/api_server.md.

🖥️ CLI Server (Persistent Session)

Runs a persistent AI session with server-side tool execution and multi-client access.

./shepherd --cliserver --port 8000

Use cases:

24/7 AI assistant with full tool access
Query databases without exposing credentials to clients
Multiple clients see the same session via SSE streaming

Connect a client:

./shepherd --backend cli --api-base http://server:8000

For full documentation, see docs/cli_server.md.

🔗 Server Composability

Shepherd's architecture allows any backend with any frontend, and servers can be chained together.

Key principle: With API backends, each incoming connection creates a new backend connection - no session contention, fully scalable.

Example: API Proxy with Credential Isolation

Hide your Azure OpenAI credentials while adding tools and your own API keys:

# Shepherd connects to Azure OpenAI (credentials stay on server)
# Clients connect to Shepherd with your API keys
./shepherd --backend openai \
           --api-base https://mycompany.openai.azure.com/v1 \
           --api-key $AZURE_KEY \
           --server --port 8000 --auth-mode json --server-tools

# Generate keys for your clients
shepherd apikey add client1
shepherd apikey add client2

Clients get:

Access to Azure OpenAI without knowing the Azure credentials
Server-side tools (filesystem, shell, MCP servers)
Your access control via Shepherd API keys

Example: Persistent Session on vLLM

Use vLLM's multi-user capabilities with a persistent CLI session:

# vLLM server running on port 5000 (handles multiple users efficiently)
# Shepherd CLI server on top for persistent session + tools
./shepherd --backend openai \
           --api-base http://localhost:5000/v1 \
           --cliserver --port 8000

Now you have:

vLLM's PagedAttention for efficient multi-conversation handling
Shepherd's persistent session (all clients see same conversation)
Server-side tools executing locally

Example: Multi-Level Chaining

# Level 1: llamacpp backend
./shepherd --backend llamacpp -m /models/qwen-72b.gguf --server --port 5000

# Level 2: API server proxy (adds tools + API keys)
./shepherd --backend openai --api-base http://localhost:5000/v1 \
           --server --port 6000 --auth-mode json --server-tools

# Level 3: CLI server for persistent session
./shepherd --backend openai --api-base http://localhost:6000/v1 \
           --cliserver --port 7000

Features

🔄 Multi-Backend Architecture

Backend	Type	Models	Context	Tools
llama.cpp	Local	Llama, Qwen, Mistral, Gemma, etc.	8K-256K	✓
TensorRT-LLM	Local	Same (NVIDIA optimized)	2K-256K	✓
OpenAI	Cloud	GPT-5, GPT-4o, GPT-4 Turbo	128K-200K	✓
Anthropic	Cloud	Claude Opus 4.5, Sonnet 4, Haiku	200K	✓
Gemini	Cloud	Gemini 3, 2.5 Pro/Flash	32K-2M	✓
Azure OpenAI	Cloud	GPT models via deployment	128K-200K	✓
Ollama	Local/Cloud	Any Ollama model	8K-128K	✓

📚 RAG System

Evicted messages are automatically archived to a SQLite database with FTS5 full-text search:

> Remember that the project deadline is March 15
* set_fact(key="project_deadline", value="March 15")

# Later, or in a new session...
> What's the project deadline?
* get_fact(key="project_deadline")

The project deadline is March 15.

Search archived conversations:

> Search my memory for discussions about authentication
* search_memory(query="authentication")

🤝 Multi-Model Collaboration

When multiple providers are configured, Shepherd creates ask_* tools for cross-model consultation:

# Using local model, ask Claude for code review
> ask_sonnet to read main.cpp and suggest improvements

* ask_sonnet(prompt="read main.cpp and suggest improvements")
  → Sonnet calls read(path="main.cpp")
  → Sonnet analyzes and responds

Claude's analysis appears in your local model's context.

Key feature: The ask_* tools have full tool access - the consulted model can read files, run commands, search memory, etc. You can chain consultations: ask Sonnet to ask GPT to analyze something.

The current provider is excluded (you don't ask yourself). Switch providers and the tools update automatically.

🧠 Automatic Session Eviction

Shepherd supports automatic eviction for indefinite conversations with any backend:

Local backends: Evicts when GPU KV cache fills
API backends: Evicts when API returns context full error, then retries
Manual limit: Use --context-size N to set a limit smaller than the backend's maximum

# Force eviction at 32K tokens even if backend supports more
./shepherd --provider azure --context-size 32768

Eviction behavior:

Oldest messages first (LRU), protecting system prompt and current context
Automatic archival to RAG database before eviction
Seamless continuation - conversation keeps going

For local backend implementation details, see docs/llamacpp.md.

⏰ Scheduling

Shepherd includes a cron-like scheduler that injects prompts into the session automatically. Works with CLI, TUI, and CLI server modes.

# Add a scheduled task (action-first for creating new)
shepherd sched add morning-news "0 9 * * *" "Get me the top 5 tech news headlines"

# List scheduled tasks
shepherd sched list

# View/modify schedules (name-first pattern)
shepherd sched morning-news show     # Show schedule details
shepherd sched morning-news disable  # Disable schedule
shepherd sched morning-news enable   # Enable schedule
shepherd sched morning-news remove   # Remove schedule

24/7 Operation: Run a CLI server and schedules execute automatically, even with no clients connected:

./shepherd --cliserver --port 8000

# Scheduled prompts run in the session:
# - "Check server disk usage" every hour
# - "Summarize overnight logs" at 6am
# - "Generate daily report" at 5pm

Clients connect to see results from scheduled tasks in the conversation history.

Command Reference

Subcommands

Command	Description
`shepherd provider <add\|list\|show\|remove\|use>`	Manage providers
`shepherd config <show\|set>`	View/modify configuration
`shepherd tools <list\|enable\|disable>`	Manage tools
`shepherd mcp <add\|remove\|list>`	Manage MCP servers
`shepherd smcp <add\|remove\|list>`	Manage SMCP servers
`shepherd sched <list\|add\|remove\|enable\|disable>`	Scheduled tasks
`shepherd apikey <add\|list\|remove>`	API key management
`shepherd edit-system`	Edit system prompt in $EDITOR

Common Flags

Flag	Description
`-p, --provider NAME`	Use specific provider
`-m, --model PATH`	Model name or file
`--backend TYPE`	Backend: llamacpp, openai, anthropic, etc.
`--context-size N`	Context window size (0 = model default)
`--server`	Start API server mode
`--cliserver`	Start CLI server mode
`--port N`	Server port (default: 8000)
`--notools`	Disable all tools
`--nostream`	Disable streaming output
`--tui` / `--no-tui`	Enable/disable TUI mode
`--config msi --kv VAULT`	Load config from Azure Key Vault

Run shepherd --help for the complete list.

Hardware Requirements

Minimum

GPU: NVIDIA GTX 1080 Ti (11GB VRAM) or better
RAM: 32GB system RAM
Storage: SATA SSD (500GB)

Cloud

AWS: g5.12xlarge (4x A10G)
GCP: a2-highgpu-4g (4x A100)
Azure: Standard_NC24ads_A100_v4

Performance

Throughput (70B model, batch_size=1)

Backend	Prompt Speed	Generation Speed	Latency
TensorRT-LLM	8000 tok/s	45 tok/s	~50ms
llama.cpp (CUDA)	1200 tok/s	25 tok/s	~80ms
llama.cpp (CPU)	150 tok/s	8 tok/s	~200ms

Memory Usage (70B model)

Configuration	VRAM	System RAM	Context
Q4_K_M + 64K ctx	38GB	8GB	65536
Q4_K_M + 128K ctx	42GB	12GB	131072
Q8_0 + 64K ctx	72GB	16GB	65536

Troubleshooting

Out of Memory During Inference

Reduce context size:

./shepherd --context-size 65536

Or use a more aggressive quantization (Q4_K_M instead of Q8_0).

Slow Generation Speed

Increase GPU layers or switch backends:

./shepherd --gpu-layers 48

KV Cache Issues

If you see repetitive or nonsensical output, the KV cache may be corrupted. Restart Shepherd to clear the cache.

For debug builds, use -d=3 for verbose KV cache logging.

Development

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                        Frontend                              │
│  CLI, TUI, API Server, CLI Server                           │
└──────────────────────────┬──────────────────────────────────┘
                           │
┌──────────────────────────v──────────────────────────────────┐
│                     Session + Provider                       │
│  Message routing, provider switching, tool execution        │
└──────────────────────────┬──────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
┌───────v────┐      ┌──────v─────┐     ┌─────v─────────┐
│  LlamaCpp  │      │  TensorRT  │     │ API Backends  │
│  Backend   │      │  Backend   │     │ (5 types)     │
└────────────┘      └────────────┘     └───────────────┘

For detailed architecture, see docs/architecture.md.

Project Structure

shepherd/
├── main.cpp              # Application entry point
├── frontend.cpp/h        # Frontend abstraction
├── backend.cpp/h         # Backend base class
├── session.cpp/h         # Session management
├── provider.cpp/h        # Provider management
├── config.cpp/h          # Configuration
├── rag.cpp/h             # RAG system
├── server.cpp/h          # HTTP server base
│
├── backends/
│   ├── llamacpp.cpp/h    # llama.cpp backend
│   ├── tensorrt.cpp/h    # TensorRT-LLM backend
│   ├── openai.cpp/h      # OpenAI API
│   ├── anthropic.cpp/h   # Anthropic Claude
│   ├── gemini.cpp/h      # Google Gemini
│   ├── ollama.cpp/h      # Ollama
│   ├── api.cpp/h         # Base for API backends
│   └── factory.cpp/h     # Backend factory
│
├── frontends/
│   ├── cli.cpp/h         # CLI frontend
│   ├── tui.cpp/h         # TUI frontend
│   ├── api_server.cpp/h  # API server
│   └── cli_server.cpp/h  # CLI server
│
├── tools/                # Tool implementations
├── mcp/                  # MCP client/server
└── Makefile              # Build system

Extending Shepherd

Adding backends: See docs/backends.md
Adding tools: See docs/tools.md (if exists) or tools/tool.h

Contributing

Contributions welcome! Areas of interest:

Additional backend integrations
New tool implementations
Performance optimizations
Documentation improvements

Testing

# Build with tests enabled
echo "TESTS=ON" >> ~/.shepherd_opts
make

# Run tests
cd build && make test_unit test_tools
./tests/test_unit
./tests/test_tools

See docs/testing.md for the full test plan.

License

PolyForm Shield License 1.0.0

✅ Use for any purpose (personal, commercial, internal)
✅ Modify and create derivative works
✅ Distribute copies
❌ Sell Shepherd as a standalone product
❌ Offer Shepherd as a paid service (SaaS)
❌ Create competing products

See LICENSE for full text.

Acknowledgments

llama.cpp: Georgi Gerganov and contributors
TensorRT-LLM: NVIDIA Corporation
Model Context Protocol: Anthropic
SQLite: D. Richard Hipp

Contact

Issues: https://github.com/sshoecraft/shepherd/issues
Discussions: https://github.com/sshoecraft/shepherd/discussions

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
backends		backends
docs		docs
frontends		frontends
include/tensorrt-llm		include/tensorrt-llm
llama.cpp @ aa37417		llama.cpp @ aa37417
mcp		mcp
nlohmann		nlohmann
packaging		packaging
patches		patches
rag		rag
scripts		scripts
tests		tests
tools		tools
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
BUILD.md		BUILD.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RULES.md		RULES.md
ansi.h		ansi.h
auth.cpp		auth.cpp
auth.h		auth.h
azure_msi.cpp		azure_msi.cpp
azure_msi.h		azure_msi.h
backend.cpp		backend.cpp
backend.h		backend.h
client.md		client.md
colors.h		colors.h
config.cpp		config.cpp
config.h		config.h
config_refactor.md		config_refactor.md
frontend.cpp		frontend.cpp
frontend.h		frontend.h
generation_thread.cpp		generation_thread.cpp
generation_thread.h		generation_thread.h
http_client.cpp		http_client.cpp
http_client.h		http_client.h
info.md		info.md
main.cpp		main.cpp
make_output.txt		make_output.txt
message.h		message.h
model.cache		model.cache
newarch.md		newarch.md
provider.cpp		provider.cpp
provider.h		provider.h
rag.cpp		rag.cpp
rag.h		rag.h
scheduler.cpp		scheduler.cpp
scheduler.h		scheduler.h
server.cpp		server.cpp
server.h		server.h
session.cpp		session.cpp
session.h		session.h
session_manager.cpp		session_manager.cpp
session_manager.h		session_manager.h
set_system_prompt.sh		set_system_prompt.sh
shep.txt		shep.txt
shepherd.h		shepherd.h
shepherd.service.in		shepherd.service.in
sse_parser.cpp		sse_parser.cpp
sse_parser.h		sse_parser.h
system_prompt.h		system_prompt.h
system_prompt.txt		system_prompt.txt
system_prompt.txt.mem		system_prompt.txt.mem
system_prompt.txt.minimal		system_prompt.txt.minimal
t		t
test		test
thread_queue.h		thread_queue.h
version.h		version.h
version.h.in		version.h.in

License

sshoecraft/shepherd

Folders and files

Latest commit

History

Repository files navigation

Shepherd

Quick Start

Usage

Interactive Mode

Providers

Tools

Configuration

Config File

MCP Servers

SMCP Servers (Secure Credentials)

Azure Key Vault

Environment Variables

Server Modes

🌐 API Server (OpenAI-Compatible)

🖥️ CLI Server (Persistent Session)

🔗 Server Composability

Example: API Proxy with Credential Isolation

Example: Persistent Session on vLLM

Example: Multi-Level Chaining

Features

🔄 Multi-Backend Architecture

📚 RAG System

🤝 Multi-Model Collaboration

🧠 Automatic Session Eviction

⏰ Scheduling

Command Reference

Subcommands

Common Flags

Hardware Requirements

Minimum

Recommended

Cloud

Performance

Throughput (70B model, batch_size=1)

Memory Usage (70B model)

Troubleshooting

Out of Memory During Inference

Slow Generation Speed

KV Cache Issues

Development

Architecture Overview

Project Structure

Extending Shepherd

Contributing

Testing

License

Acknowledgments

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages