Skip to content

Run a 1-billion parameter LLM on a $10 board with 256MB RAM

License

Notifications You must be signed in to change notification settings

addityeah/picolm

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

C11 Binary Size RAM Zero Dependencies MIT License

PicoLM

Run a 1-billion parameter LLM on a $10 board with 256MB RAM.
Pure C. Zero dependencies. One binary. No Python. No cloud.

echo "Explain gravity" | ./picolm model.gguf -n 100 -j 4


The Perfect Match: PicoLM + PicoClaw

PicoLM β€” Run a 1-billion parameter LLM on a $10 board

PicoLM was built as the local brain for PicoClaw β€” an ultra-lightweight AI assistant in Go that runs on $10 hardware. Together, they form a fully offline AI agent β€” no cloud, no API keys, no internet, no monthly bills.

Every other LLM provider needs the internet. PicoLM doesn't.

The Hardware The Architecture
$9.90 LicheeRV Nano PicoClaw architecture β€” PicoLM sits in the LLM box
$9.90 β€” that's the entire server PicoLM powers the LLM box in PicoClaw's agent loop

Why they're a perfect fit

Cloud Provider (OpenAI, etc.) PicoLM (Local)
Cost Pay per token, forever Free forever
Privacy Your data sent to servers Everything stays on-device
Internet Required for every request Not needed at all
Latency Network round-trip + inference Inference only
Hardware Needs a $599 Mac Mini Runs on a $10 board
Binary N/A ~80KB single file
RAM N/A 45 MB total

How it works

PicoClaw's agent loop spawns PicoLM as a subprocess. Messages come in from Telegram, Discord, or CLI β€” PicoClaw formats them into a chat template, pipes the prompt to picolm via stdin, and reads the response from stdout. When tools are needed, --json grammar mode guarantees valid JSON even from a 1B model.

Telegram / Discord / CLI
        β”‚
        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    stdin: prompt     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ PicoClaw β”‚ ──────────────────►  β”‚  picolm   β”‚
   β”‚   (Go)   β”‚ ◄──────────────────  β”‚   (C)     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    stdout: response  β”‚ + model   β”‚
        β”‚                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β–Ό                            45 MB RAM
   User gets reply                   No internet

Quick setup

# 1. Build PicoLM
cd picolm && make native    # or: make pi (Raspberry Pi)

# 2. Download model (one-time, 638 MB)
make model

# 3. Build PicoClaw
cd ../picoclaw && make deps && make build

# 4. Configure (~/.picoclaw/config.json)
{
  "agents": {
    "defaults": {
      "provider": "picolm",
      "model": "picolm-local"
    }
  },
  "providers": {
    "picolm": {
      "binary": "~/.picolm/bin/picolm",
      "model": "~/.picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
      "max_tokens": 256,
      "threads": 4,
      "template": "chatml"
    }
  }
}
# 5. Chat β€” fully offline!
picoclaw agent -m "What is photosynthesis?"

Or install everything in one line

curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash

Performance on real hardware

Device Price Generation Speed RAM Used
Pi 5 (4-core) $60 ~10 tok/s 45 MB
Pi 4 (4-core) $35 ~8 tok/s 45 MB
Pi 3B+ $25 ~4 tok/s 45 MB
Pi Zero 2W $15 ~2 tok/s 45 MB
LicheeRV Nano $10 ~1 tok/s 45 MB

JSON tool calling

PicoClaw automatically activates --json grammar mode when it needs structured output. This guarantees syntactically valid JSON even from a 1B parameter model β€” essential for reliable tool calling on tiny hardware:

picoclaw agent -m "Search for weather in Tokyo"
# β†’ PicoLM generates: {"tool_calls": [{"function": {"name": "web_search", "arguments": "{\"query\": \"weather Tokyo\"}"}}]}

For the full PicoClaw documentation, see the PicoClaw README.


What is PicoLM?

PicoLM is a minimal, from-scratch LLM inference engine written in ~2,500 lines of C11. It runs TinyLlama 1.1B (and other LLaMA-architecture models in GGUF format) on hardware that most inference frameworks won't even consider:

  • Raspberry Pi Zero 2W ($15, 512MB RAM, ARM Cortex-A53)
  • Sipeed LicheeRV ($12, 512MB RAM, RISC-V)
  • Raspberry Pi 3/4/5 (1-8GB RAM, ARM NEON SIMD)
  • Any Linux/Windows/macOS x86-64 machine

The model file (638MB) stays on disk. PicoLM memory-maps it and streams one layer at a time through RAM. Total runtime memory: ~45MB including the FP16 KV cache.

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   What goes        β”‚         45 MB Runtime RAM                β”‚
   in RAM           β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                    β”‚  β”‚ Buffers β”‚ β”‚ FP16 KV  β”‚ β”‚ Tokenizer β”‚  β”‚
                    β”‚  β”‚  1.2 MB β”‚ β”‚ Cache    β”‚ β”‚   4.5 MB  β”‚  β”‚
                    β”‚  β”‚         β”‚ β”‚  ~40 MB  β”‚ β”‚           β”‚  β”‚
                    β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   What stays       β”‚        638 MB Model on Disk              β”‚
   on disk          β”‚       (mmap β€” OS pages in layers         β”‚
   (via mmap)       β”‚        as needed, ~1 at a time)          β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Features

Feature Description
GGUF Native Reads GGUF v2/v3 files directly β€” no conversion needed
K-Quant Support Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32
mmap Layer Streaming Model weights stay on disk; OS pages in one layer at a time
FP16 KV Cache Halves KV cache memory (44MB vs 88MB for 2048 context)
Flash Attention Online softmax β€” no O(seq_len) attention buffer needed
Pre-computed RoPE cos/sin lookup tables eliminate transcendentals from hot loop
SIMD Acceleration ARM NEON (Pi 3/4/5) and x86 SSE2 (Intel/AMD) auto-detected
Fused Dot Products Dequantize + dot-product in one pass β€” no intermediate buffer
Multi-threaded matmul Parallel matrix-vector multiply across CPU cores
Grammar-Constrained JSON --json flag forces valid JSON output (for tool calling)
KV Cache Persistence --cache saves/loads prompt state β€” skip prefill on re-runs
BPE Tokenizer Score-based byte-pair encoding, loaded from GGUF metadata
Top-p Sampling Temperature + nucleus sampling with configurable seed
Pipe-friendly Reads prompts from stdin: echo "Hello" | ./picolm model.gguf
Zero Dependencies Only libc, libm, libpthread. No external libraries.
Cross-platform Linux, Windows (MSVC), macOS. ARM, x86-64, RISC-V.

Quick Start

One-liner install (Raspberry Pi / Linux)

curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash

This will:

  1. Detect your platform (ARM64, ARMv7, x86-64)
  2. Install build dependencies (gcc, make, curl)
  3. Build PicoLM with optimal SIMD flags for your CPU
  4. Download TinyLlama 1.1B Q4_K_M (638 MB)
  5. Run a quick test
  6. Generate PicoClaw config
  7. Add picolm to your PATH

Build from source

git clone https://github.com/rightnow-ai/picolm.git
cd picolm/picolm

# Auto-detect CPU (enables SSE2/AVX on x86, NEON on ARM)
make native

# Download a model
make model

# Run it
./picolm /opt/picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
    -p "The meaning of life is" -n 100

Build on Windows (MSVC)

cd picolm
build.bat
picolm.exe model.gguf -p "Hello world" -n 50

Platform-specific builds

make native      # x86/ARM auto-detect (recommended for local machine)
make pi          # Raspberry Pi 3/4/5 (64-bit ARM + NEON SIMD)
make pi-arm32    # Pi Zero / Pi 1 (32-bit ARM)
make cross-pi    # Cross-compile for Pi from x86 (static binary)
make riscv       # RISC-V (Sipeed LicheeRV, etc.)
make static      # Static binary for single-file deployment
make debug       # Debug build with symbols, no optimization

Usage

PicoLM β€” ultra-lightweight LLM inference engine

Usage: picolm <model.gguf> [options]

Generation options:
  -p <prompt>    Input prompt (or pipe via stdin)
  -n <int>       Max tokens to generate (default: 256)
  -t <float>     Temperature (default: 0.8, 0=greedy)
  -k <float>     Top-p / nucleus sampling (default: 0.9)
  -s <int>       RNG seed (default: 42)
  -c <int>       Context length override
  -j <int>       Number of threads (default: 4)

Advanced options:
  --json         Grammar-constrained JSON output mode
  --cache <file> KV cache file (saves/loads prompt state)

Examples

Basic generation:

./picolm model.gguf -p "Once upon a time" -n 200

Greedy decoding (deterministic, temperature=0):

./picolm model.gguf -p "The capital of France is" -n 20 -t 0
# Output: Paris. It is the largest city in France and...

Chat with TinyLlama (ChatML format):

./picolm model.gguf -n 200 -t 0.7 -p "<|user|>
What is photosynthesis?</s>
<|assistant|>
"

Force JSON output (for tool calling / structured data):

./picolm model.gguf --json -t 0.3 -n 100 -p "<|user|>
Return the current time as JSON.</s>
<|assistant|>
"
# Output: {"time": "12:00 PM"}

Pipe from stdin:

echo "Explain quantum computing in one sentence" | ./picolm model.gguf -n 50

KV cache β€” skip repeated prefill:

# First run: processes prompt + saves cache
./picolm model.gguf --cache prompt.kvc -p "Long system prompt here..." -n 50

# Second run: loads cache, skips prompt prefill (74% faster)
./picolm model.gguf --cache prompt.kvc -p "Long system prompt here..." -n 50
# Output: "Skipping 25 cached prompt tokens"

Multi-threaded on a Pi 4 (4 cores):

./picolm model.gguf -p "Hello" -n 100 -j 4

Performance

Measured on TinyLlama 1.1B Q4_K_M (638 MB model):

Metric x86-64 (8 threads) Pi 4 (4 cores, NEON) Pi Zero 2W
Prefill ~11 tok/s ~6 tok/s ~1.5 tok/s
Generation ~13 tok/s ~8 tok/s* ~2 tok/s*
Runtime RAM 45 MB 45 MB 45 MB
First token ~2.3s ~4s ~16s
Binary size ~80 KB ~70 KB ~65 KB

*Estimated with NEON SIMD enabled. Actual numbers depend on SD card speed and thermal throttling.

What makes it fast

 Raw C inference          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  13.5 tok/s  (baseline: 1.6)
 + Fused dot products     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘  (eliminate dequant buffer)
 + Multi-threaded matmul  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  (4-8 cores in parallel)
 + FP16 KV cache          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘  (halve memory bandwidth)
 + Pre-computed RoPE      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  (no sin/cos in hot loop)
 + Flash attention        β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  (no O(n) attention alloc)
 + NEON/SSE2 SIMD         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘  (4-wide vector ops)
 + KV cache persistence   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  (skip prefill entirely)

Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚           picolm.c              β”‚
                          β”‚     CLI + Generation Loop       β”‚
                          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              └────────────┐
                    β”‚                                        β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β”‚    model.h/c    β”‚                    β”‚    sampler.h/c      β”‚
           β”‚  GGUF Parser    β”‚                    β”‚  Temperature +      β”‚
           β”‚  mmap Layer     β”‚                    β”‚  Top-p Sampling     β”‚
           β”‚  Streaming      β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚  Forward Pass   β”‚                               β”‚
           β”‚  KV Cache I/O   β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β””β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                    β”‚    grammar.h/c      β”‚
               β”‚        β”‚                         β”‚  JSON Constraint    β”‚
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”˜        └───────┐                 β”‚  Logit Masking      β”‚
      β”‚                         β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ tensor.h/c β”‚          β”‚ tokenizer.h/c  β”‚
β”‚ matmul     β”‚          β”‚ BPE Encode     β”‚
β”‚ rmsnorm    β”‚          β”‚ Decode         β”‚
β”‚ softmax    β”‚          β”‚ Vocab Lookup   β”‚
β”‚ rope       β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ silu       β”‚
β”‚ threading  β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
      β”‚
β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
β”‚  quant.h/c β”‚
β”‚ Q4_K, Q6_K β”‚
β”‚ Q3_K, Q2_K β”‚
β”‚ FP16, F32  β”‚
β”‚ NEON + SSE β”‚
β”‚ Fused Dots β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The LLaMA Forward Pass (what happens for each token)

Input Token
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Embedding     β”‚  Dequantize row from token_embd β†’ x[2048]
β”‚ Lookup        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  Γ—22 layers
β”‚ RMSNorm       │─────────────────────────────────────────┐
β”‚               β”‚                                         β”‚
β”‚ Q = xb @ Wq   β”‚  Matrix-vector multiply (quantized)     β”‚
β”‚ K = xb @ Wk   β”‚  Store K,V in FP16 KV cache             β”‚
β”‚ V = xb @ Wv   β”‚                                         β”‚
β”‚               β”‚                                         β”‚
β”‚ RoPE(Q, K)    β”‚  Rotary position encoding (table lookup)β”‚
β”‚               β”‚                                         β”‚
β”‚ Attention     β”‚  Flash attention with online softmax    β”‚
β”‚ (GQA 32β†’4)    β”‚  Grouped-query: 32 Q heads, 4 KV heads  β”‚
β”‚               β”‚                                         β”‚
β”‚ x += Out@Wo   β”‚  Output projection + residual           β”‚
β”‚               β”‚                                         β”‚
β”‚ RMSNorm       β”‚                                         β”‚
β”‚               β”‚                                         β”‚
β”‚ SwiGLU FFN    β”‚  gate=SiLU(xb@Wg), up=xb@Wu             β”‚
β”‚               β”‚  x += (gate*up) @ Wd                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Final RMSNorm β”‚
β”‚ x @ W_output  │─→ logits[32000]
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Grammar Mask  β”‚  (if --json: force valid JSON structure)
β”‚ Sample Token  β”‚  temperature β†’ softmax β†’ top-p β†’ pick
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Memory Budget

For TinyLlama 1.1B Q4_K_M with 2048 context length:

Component Size Notes
FP16 KV cache ~40 MB 22 layers x 2 x 2048 x 256 x 2 bytes
Tokenizer ~4.5 MB 32K vocab strings + scores + sorted index
Activation buffers ~0.14 MB x, xb, xb2, q, hb, hb2
Logits buffer ~0.12 MB 32000 x 4 bytes
Dequant scratch ~0.02 MB Max(n_embd, n_ffn) floats
Norm weights (pre-dequant) ~0.35 MB 45 norm vectors x 2048 x 4 bytes
RoPE tables ~0.03 MB cos + sin x 2048 x 32 entries
Total runtime ~45 MB
Model file (on disk) 638 MB Memory-mapped, ~1 layer in RAM at a time

With 512 context (for constrained devices):

Component Size
FP16 KV cache ~10 MB
Everything else ~5 MB
Total ~15 MB

Optimizations Deep-Dive

PicoLM implements 9 optimizations that brought generation speed from 1.6 tok/s to 13.5 tok/s on x86, with even larger gains expected on ARM with NEON:

1. ARM NEON SIMD

4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with vmovl_u8 β†’ vmovl_u16 β†’ vcvtq_f32_u32, and RoPE with interleaved vld2q_f32 / vst2q_f32.

2. x86 SSE2 SIMD

Auto-detected on Intel/AMD. 4-wide __m128 operations for dot products, RMSNorm, and vector operations.

3. FP16 KV Cache

Key and value vectors stored as 16-bit floats instead of 32-bit. Halves KV cache memory from ~88MB to ~44MB. Conversion uses software fp32_to_fp16() / fp16_to_fp32() β€” no hardware FP16 support required.

4. Pre-computed RoPE Tables

Sine and cosine values for all positions computed once at model load. The forward pass does a table lookup instead of calling sinf() / cosf() / powf() 64 times per token.

5. Flash Attention (Online Softmax)

Single-pass attention with running maximum rescaling. Eliminates the O(seq_len) attention score buffer β€” critical for long contexts on memory-constrained devices.

6. Fused Dequantize + Dot Product

vec_dot_q4_K_f32() dequantizes and accumulates in one pass. No intermediate float buffer for the weight row. Reduces memory traffic by ~50% for matmul.

7. Multi-threaded Matrix Multiply

matmul() distributes output rows across threads using pthreads. Each thread processes its chunk independently with fused dot products. Scales linearly up to ~8 cores.

8. Grammar-Constrained JSON

The --json mode pre-analyzes every token in the vocabulary at load time (brace delta, bracket delta, quote parity). During generation, it masks logits to guarantee syntactically valid JSON β€” essential for tool-calling with small models.

9. KV Cache Persistence

--cache file.kvc saves the FP16 KV cache state after prompt processing. On the next run with the same prompt, it loads the cache and skips prefill entirely. 74% latency reduction for repeated system prompts.


Supported Models

PicoLM supports any LLaMA-architecture model in GGUF format:

Model Parameters GGUF Size (Q4_K_M) RAM Needed
TinyLlama 1.1B 1.1B 638 MB ~45 MB
Llama 2 7B 7B 4.1 GB ~200 MB
Phi-2 2.7B 1.6 GB ~90 MB

Recommended for embedded: TinyLlama 1.1B Q4_K_M β€” fits comfortably on devices with 256MB+ RAM.

Supported quantization formats

Q2_K Q3_K Q4_K Q4_0 Q5_K Q6_K Q8_0 F16 F32


File Structure

PicoLM/
β”œβ”€β”€ README.md              ← you are here
β”œβ”€β”€ BLOG.md                ← technical deep-dive blog post
β”œβ”€β”€ install.sh             ← one-liner Pi installer
β”‚
β”œβ”€β”€ picolm/                ← the inference engine (pure C)
β”‚   β”œβ”€β”€ picolm.c           ← CLI entry point, generation loop (273 lines)
β”‚   β”œβ”€β”€ model.h/c          ← GGUF parser, mmap, forward pass (146 + 833 lines)
β”‚   β”œβ”€β”€ tensor.h/c         ← matmul, rmsnorm, softmax, rope (44 + 298 lines)
β”‚   β”œβ”€β”€ quant.h/c          ← dequantization, SIMD kernels (140 + 534 lines)
β”‚   β”œβ”€β”€ tokenizer.h/c      ← BPE tokenizer (32 + ~200 lines)
β”‚   β”œβ”€β”€ sampler.h/c        ← temperature + top-p sampling (19 + ~100 lines)
β”‚   β”œβ”€β”€ grammar.h/c        ← JSON grammar constraints (64 + 175 lines)
β”‚   β”œβ”€β”€ Makefile           ← build targets for all platforms
β”‚   └── build.bat          ← Windows MSVC build script
β”‚
└── tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf  ← model file (638 MB, not in git)

Total C source: ~2,500 lines. That's the entire inference engine β€” GGUF parsing, mmap, dequantization, matrix math, attention, tokenization, sampling, and grammar constraints.


How It Works

The mmap trick

Traditional inference engines load the entire model into RAM. PicoLM doesn't. Instead:

  1. The model file is memory-mapped (mmap on Linux/macOS, MapViewOfFile on Windows)
  2. Weight pointers point directly into the mapped file β€” no copying
  3. During the forward pass, each layer's weights are accessed sequentially
  4. The OS automatically pages in the needed weights and evicts old ones
  5. madvise(MADV_SEQUENTIAL) hints the access pattern to the kernel

Result: A 638MB model runs on a device with 256MB RAM. Only ~30MB of the model is in physical memory at any time.

Quantization

Weights are stored in 4-bit quantized format (Q4_K_M). For TinyLlama:

  • Original: 1.1B parameters x 4 bytes = 4.4 GB
  • Q4_K: 1.1B parameters x ~0.56 bytes = 638 MB
  • Quality loss: Minimal β€” Q4_K preserves 6-bit scales per 32-weight sub-block

Grouped-Query Attention (GQA)

TinyLlama uses 32 query heads but only 4 key/value heads. Each KV head is shared by 8 query heads. This reduces KV cache size by 8x compared to full multi-head attention.


Building & Testing

Prerequisites

Platform Requirements
Linux/Pi gcc, make (install via apt install build-essential)
macOS Xcode Command Line Tools (xcode-select --install)
Windows Visual Studio Build Tools (cl.exe)

Verify your build

# Build
make native

# Test with greedy decoding (deterministic output)
./picolm model.gguf -p "The capital of France is" -n 20 -t 0
# Expected: "Paris. It is the largest city in France..."

# Test JSON mode
./picolm model.gguf --json -p "Return JSON with name and age" -n 50 -t 0.3
# Expected: valid JSON like {"name": "...", "age": ...}

# Test KV cache
./picolm model.gguf --cache test.kvc -p "Hello" -n 10 -t 0
./picolm model.gguf --cache test.kvc -p "Hello" -n 10 -t 0
# Second run should say "Skipping N cached prompt tokens"

Memory verification

PicoLM prints memory stats to stderr:

Memory: 1.17 MB runtime state (FP16 KV cache separate)

Total = runtime state + FP16 KV cache. For TinyLlama with 2048 context: ~45 MB.


FAQ

Q: Can this run Llama 2 7B? A: Yes, if you have enough RAM for the KV cache (~1.4 GB for 7B with 4096 context). The model file stays on disk via mmap. On a Pi 4 with 4GB RAM, it works but is slow (~1-2 tok/s).

Q: Why not use llama.cpp? A: llama.cpp is excellent but requires ~200MB+ for the runtime on small models, has complex build dependencies, and targets desktop/server use cases. PicoLM is purpose-built for embedded: 45MB RAM, 80KB binary, zero dependencies.

Q: Is the output quality good? A: TinyLlama 1.1B is a small model β€” it handles simple tasks (Q&A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a $10 board with no internet. For structured output, the --json grammar mode guarantees valid JSON regardless of model quality.

Q: What about GPU acceleration? A: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2) provides meaningful speedup.

Q: Can I use a different model? A: Any LLaMA-architecture GGUF model works. Download from HuggingFace and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality/size balance) or Q2_K (smallest, lower quality).


Roadmap

  • AVX2/AVX-512 kernels for x86 (2-4x generation speed on modern CPUs)
  • Speculative decoding with a draft model
  • Context sliding window (infinite generation beyond max_seq_len)
  • Weight pruning for further memory reduction
  • Continuous batching for server mode
  • Mistral / Phi architecture support

Technical Blog

For a detailed writeup of the optimization journey (with code snippets and war stories), see BLOG.md.


License

MIT License. See LICENSE for details.


PicoLM β€” because intelligence shouldn't require a data center.

About

Run a 1-billion parameter LLM on a $10 board with 256MB RAM

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C 89.8%
  • Shell 8.0%
  • Makefile 1.9%
  • Batchfile 0.3%