Pegaflow

PegaFlow is a high-performance KV cache offloading solution for vLLM v1 on single-node multi-GPU setups.

What is PegaFlow?

PegaFlow enables efficient KV cache transfer and sharing for vLLM inference workloads:

Single-node KV cache offloading — offload KV cache to host memory and restore it back to GPU with minimal latency
Full parallelism support — works with Pipeline Parallel (PP), Tensor Parallel (TP), and Data Parallel (DP)
Layer-wise transfer — fine-grained, layer-by-layer KV cache operations for optimal memory utilization
P/D disaggregation — separate prefill and decode phases across GPUs for better resource utilization
~9x TTFT improvement — warm-start requests achieve dramatically lower time-to-first-token compared to cold-start

PegaFlow draws its name from Pegasus, the winged horse of ancient myth — a creature born to cross impossible distances with effortless grace.

Quick Start

1. Clone & Setup

git clone https://github.com/xiaguan/pegaflow
cd pegaflow

uv venv
source .venv/bin/activate
uv pip install maturin
uv pip install vllm

2. Start PegaEngine Server

# Set environment variables for PyO3
export PYO3_PYTHON=$(which python)
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH

# Start the server (keep running in a separate terminal)
cargo run -r --bin pegaflow-server -- --addr 0.0.0.0:50055 --device 0 --pool-size 30gb

Server Options:

--addr: Bind address (default: 127.0.0.1:50055)
--device: CUDA device ID (default: 0)
--pool-size: Pinned memory pool size (e.g., 10gb, 500mb, 1.5tb)

3. Build Python Bindings

cd python
maturin develop --release
cd ..

4. Run Hello World

uv run python examples/basic_vllm.py

The script loads GPT-2 through vLLM, runs a prompt twice (cold + warm), and shows the TTFT difference.

Benchmarks

KV Cache Benchmark

examples/bench_kv_cache.py automates a full TTFT-focused benchmark. Provide your checkpoint path (we test with Llama-3.1-8B on H800):

uv run python examples/bench_kv_cache.py \
  --model /path/to/your/Llama-3.1-8B \
  --num-prompts 10 \
  --input-len 4096 \
  --output-len 1 \
  --request-rate 1.0

H800 Reference Numbers

PegaFlow TTFT measurements from an H800 with Llama-3.1-8B (8 prompts, 10K-token prefill, 1-token decode, 4.0 req/s):

Configuration	TTFT mean (ms)	TTFT p99 (ms)
PegaFlow (Cold)	572.5	1113.7
PegaFlow (Warm)	61.5	77.0

The warm-start path achieves ~9x faster TTFT compared to cold-start, demonstrating effective KV cache sharing across requests.

vLLM Patch for Better I/O Performance

To maximize I/O throughput when using PegaFlow with vLLM, we recommend a small patch to vLLM's KV cache block allocation. Sorting block IDs ensures GPU memory addresses are as sequential and contiguous as possible, which improves DMA/RDMA transfer efficiency.

Upstream RFC: vllm-project/vllm#31371

Locate the file vllm/v1/core/kv_cache_utils.py in your vLLM installation (e.g., .venv/lib/python3.10/site-packages/vllm/v1/core/kv_cache_utils.py), find the append_n method in the FreeKVCacheBlockQueue class, and add a sorting line:

def append_n(self, blocks: list[KVCacheBlock]) -> None:
    """Put a list of blocks back into the free list

    Args:
        blocks: The blocks to append.
    """
    if len(blocks) == 0:
        return

    blocks.sort(key=lambda x: x.block_id)  # <-- Add this line
    last_block = self.fake_free_list_tail.prev_free_block
    ...

This simple change can noticeably reduce transfer latency by enabling more efficient memory access patterns during KV cache operations.

P/D Disaggregation

PegaFlow supports Prefill/Decode (P/D) disaggregation, where prefill and decode phases run on separate GPU nodes. A lightweight router coordinates the flow: requests first go to P nodes for prefill (KV cache generation), then to D nodes for decode (token generation).

Architecture

Client Request
      │
      ▼
   Router (:8000)
      │
      ├──► P Node (:8100) ─── prefill, generate KV cache ───┐
      │                                                     │
      │                                                     ▼
      │                                              PegaEngine Server
      │                                            (centralized KV store)
      │                                                     │
      └──► D Node (:8200) ◄─── load KV cache ───────────────┘
               │
               ▼
        Response to Client

Quick Start

Start the PegaEngine server (centralized KV cache storage):

cargo run -r --bin pegaflow-server -- --addr 0.0.0.0:50055 --device 0 --pool-size 30gb

Launch P/D setup:

uv run python examples/run_vllm_pd_with_pega.py --model Qwen/Qwen3-8B

Run benchmark:

vllm bench serve --port 8000 --seed 42 \
  --model Qwen/Qwen3-8B \
  --dataset-name random --random-input-len 5000 --random-output-len 200 \
  --num-prompts 200 --burstiness 100 --request-rate 1

Benchmark Results (H800, Qwen3-8B, 5K input tokens)

Configuration	TTFT mean (ms)	TPOT mean (ms)	TPOT p99 (ms)	ITL p99 (ms)
P/D (1P+1D)	573.78	15.68	15.89	21.71
Baseline (DP2)	438.24	22.67	24.32	142.70

P/D disaggregation trades higher TTFT for significantly more stable decode latency — TPOT p99 drops from 24.32ms to 15.89ms, and ITL p99 improves dramatically from 142.70ms to 21.71ms.

Goals

A Data Path Purpose-Built for LLM Inference

Focus exclusively on the typical data flows in large model inference: data movement between prefill and decode phases, between different compute roles, and high-throughput transport of weights and KV/activations. We only solve this specific class of problems—high-bandwidth, predictable, structured data paths.
RDMA-First, High-Performance Implementation

The initial version prioritizes RDMA, leveraging static topology, long-lived connections, and pre-allocated resources to push throughput, stability, and tail latency close to hardware limits—validating the value of a "dedicated transport layer".
Developer-Friendly Abstractions

Provide clear, minimal transport semantics and channel models: easy to understand, simple to integrate, and predictable in behavior. Avoid hidden policies that cause mysterious performance jitter, allowing users to make confident performance assumptions.
Built-In Observability and Tunability

Export key metrics and debugging information from day one (throughput, latency distribution, resource utilization, error signals, etc.), giving cluster operators data to guide topology and parameter tuning—rather than black-box trial-and-error.
Embeddable in Existing Inference Systems

Serve as an optional "transport backend" that can plug into existing inference/dispatch/scheduling components—without requiring a full rewrite of the upper layers—ensuring the PoC can be validated quickly in real production stacks.

Non-Goals

Not a General-Purpose RPC or Service Framework

No request routing, load balancing, IDL, or serialization format wars—these concerns belong to upper layers or other projects.
Not a Universal Network Virtualization Layer

No attempt to automatically adapt to all network environments, cloud providers, or dynamic topologies; the initial focus is deep optimization for known, controlled, performance-sensitive clusters.
Not a Full-Featured Communication Middleware

Does not cover collectives, group communication semantics, or a comprehensive flow control ecosystem—only focused on high-value point-to-point (or few-node) bulk transfer scenarios.
Not a "Runs Everywhere" Compatibility Solution

No compromising design sharpness for compatibility with low-spec or non-accelerated network environments; other protocols or software fallbacks are incremental extensions, not core promises.
Not a Security or Compliance Component

No built-in complex authentication, encryption, or multi-tenant isolation; default assumption is deployment in controlled environments, with security handled by infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
pegaflow-core		pegaflow-core
pegaflow-proto		pegaflow-proto
pegaflow-server		pegaflow-server
python		python
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
pegaflow.code-workspace		pegaflow.code-workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pegaflow

What is PegaFlow?

Quick Start

1. Clone & Setup

2. Start PegaEngine Server

3. Build Python Bindings

4. Run Hello World

Benchmarks

KV Cache Benchmark

vLLM Patch for Better I/O Performance

P/D Disaggregation

Architecture

Quick Start

Benchmark Results (H800, Qwen3-8B, 5K input tokens)

Goals

Non-Goals

About

Uh oh!

Releases

Packages

Languages

novitalabs/pegaflow

Folders and files

Latest commit

History

Repository files navigation

Pegaflow

What is PegaFlow?

Quick Start

1. Clone & Setup

2. Start PegaEngine Server

3. Build Python Bindings

4. Run Hello World

Benchmarks

KV Cache Benchmark

vLLM Patch for Better I/O Performance

P/D Disaggregation

Architecture

Quick Start

Benchmark Results (H800, Qwen3-8B, 5K input tokens)

Goals

Non-Goals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages