Skip to content

Self-hosted LLM inference platform running 15 models (270M-12B) with OpenAI-compatible APIs. Experimental testbed for gesture UI, AI-generated interfaces, and multi-model collaboration.

License

Notifications You must be signed in to change notification settings

jonasneves/serverless-llm

Repository files navigation

LLM Playground

API Status

Overview

Self-hosted LLM inference platform serving 15 models (270M to 12B parameters) with OpenAI-compatible APIs. Experimental testbed for gesture-based interaction, AI-generated UIs, and multi-model collaboration modes (discussion, council, roundtable).

Models

Models ranked by overall capability based on December 2025 benchmarks (MMLU-Pro, GPQA, AIME, MATH, HumanEval):

Rank Model Size Key Benchmarks Best For
1 Nanbeige4-3B-Thinking 3B AIME 2024: 90.4%, GPQA-Diamond: 82.2% (outperforms Qwen3-32B) Step-by-step reasoning, complex math, competitive programming
2 DASD-4B Thinking 4B Reasoning with thinking capabilities Step-by-step reasoning, problem solving
2 Qwen3-4B-Instruct-2507 4B MMLU-Pro: 69.6%, GPQA: 62.0%, 262K context, 119 languages Multilingual tasks, long-context analysis, agent workflows
3 AgentCPM-Explore 4B 4B Autonomous task exploration, agentic operations Autonomous exploration, task planning
3 SmolLM3 3B 3B AIME 2025: 36.7%, BFCL: 92.3%, 64K context, hybrid reasoning Tool-calling, reasoning with /think mode, multilingual (6 langs)
4 LFM2.5 1.2B 1.2B 8 languages, 32K context, hybrid LFM2 architecture, RL tuning Edge deployment, instruction following, multilingual
5 DeepSeek R1 1.5B 1.5B AIME 2024: 28.9%, MATH-500: 83.9%, Codeforces: 954 rating Math reasoning, algorithmic problems, code generation
6 Gemma 3 12B 12B Safety-aligned IT checkpoint, stronger instruction following, ~8K context Fact-checking, educational content, safe generation
7 Mistral 7B v0.3 7B MMLU: 63%, 32K context, native function calling JSON generation, tool use, structured output
8 Phi-4 Mini 3.8B GSM8K: 88.6%, 128K context, 22 languages, function calling Math reasoning, multilingual, tool use
9 RNJ-1 Instruct 8B SWE-Bench Verified: 20.8%, strong tool-use (BFCL ranked) Code automation, agentic workflows, tool calling
10 Llama 3.2 3B 3B MMLU: 63.4%, 128K context, multilingual (8 languages) Casual conversation, summarization, creative writing
11 FunctionGemma 270M 270M Edge-optimized (50 t/s on Pixel 8), 240MB RAM (Q4), 32K context Edge device agents, mobile actions, offline function calling
12 GPT-OSS 20B 20B MoE (~3.6B active) Function calling, agentic operations Experimental MoE, agent operations (slow on CPU)

Sources

Model Source
Nanbeige4-3B-Thinking arXiv, Hugging Face, MarkTechPost
DASD-4B Thinking Hugging Face
Qwen3-4B-Instruct-2507 Hugging Face Model Card
AgentCPM-Explore 4B Hugging Face
SmolLM3 3B Hugging Face, Blog
LFM2.5 1.2B Liquid AI Docs, Hugging Face, Playground
DeepSeek R1 1.5B OpenRouter, DataCamp
Gemma 3 12B Google Blog, Unsloth
Mistral 7B v0.3 Mistral AI, Hugging Face
Phi-4 Mini Hugging Face, Microsoft
RNJ-1 Instruct Hugging Face, Ollama
Llama 3.2 3B NVIDIA, Meta
FunctionGemma 270M Google Blog, Unsloth

API

Endpoints

Each model exposes OpenAI-compatible endpoints:

Endpoint Method Description
/health GET Health check and model status
/v1/models GET List available models
/v1/chat/completions POST Chat completion (streaming supported)

Example Request

curl -X POST <YOUR_MODEL_API_URL>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Explain quantum computing"}],
    "max_tokens": 512,
    "temperature": 0.7,
    "stream": true
  }'

Performance Debugging

  • Check runtime settings via GET /health/details (includes n_ctx, n_threads, n_batch, max_concurrent)
  • Add "include_perf": true to /v1/chat/completions to return queue/compute timing (and TTFT for streaming)
  • Compare models with python3 scripts/bench_models.py --models qwen phi llama --stream --include-perf

Project Structure

llm-playground/
├── .github/workflows/          # CI/CD workflows
├── app/
│   ├── shared/                 # Shared inference server (base code for all models)
│   ├── lfm2-inference/         # LFM2.5 model config (native llama-server)
│   ├── rnj-inference/          # RNJ model config (native llama-server)
│   └── chat/                   # Web interface + API proxy
├── config/
│   └── models.py               # Model ports, metadata, and inference settings
├── scripts/                    # Automation scripts
└── README.md

Configuration

Centralized Config: All model and inference settings are managed in config/models.py:

  • n_ctx: Context window size (default: 4096)
  • n_threads: CPU threads (default: 4)
  • n_batch: Batch size (default: 256)
  • max_concurrent: Parallel requests per instance (default: 2)

Local Development

Run models and the web interface locally.

Prerequisites

pip install -r app/qwen-inference/requirements.txt

Port Scheme (Local Development)

For local development with multiple models on the same machine:

Range Category Models
8080 Core Chat Interface
81XX Small (<7B) qwen (8100), phi (8101), functiongemma (8103), smollm3 (8104), lfm2 (8105), dasd (8106), agentcpm (8107)
82XX Medium (7B-30B) gemma (8200), llama (8201), mistral (8202), rnj (8203)
83XX Reasoning r1qwen (8300), nanbeige (8301), gptoss (8303)

Production deployment uses port 8000 for all inference models (each runs in a separate container).

See config/models.py for the authoritative configuration.

Start a Model Server

cd app/qwen-inference
python inference_server.py  # Runs on port 8100

Start the Web Interface

cd app/chat
export QWEN_API_URL=http://localhost:8100
python chat_server.py  # Runs on port 8080

Run with Docker Compose

# Chat interface only
docker-compose up

# Chat + specific models
docker-compose --profile qwen --profile phi up

# All services
docker-compose --profile all up

License

MIT License - see LICENSE for details

About

Self-hosted LLM inference platform running 15 models (270M-12B) with OpenAI-compatible APIs. Experimental testbed for gesture UI, AI-generated interfaces, and multi-model collaboration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 2

  •  
  •