Elastic-Cache is a training-free framework designed to accelerate diffusion language models through efficient KV caching. It features:
-
Fast and accurate KV caching for diffusion LLMs, achieving up to 45× speedup over non-accelerated baselines with only a minor drop in accuracy.
-
Layer-aware and time-aware KV caching, enabling the model to determine where and when caching is most effective.
-
An automatic caching mechanism that analyzes attention drift, removing the need for predefined cache schedules used in prior work.
-
Ready-to-use, training-free component for accelerating diffusion LLMs.
-
Provides a controllable trade-off between accuracy and latency.
-
Architecture-agnostic, supporting various open-source diffusion LLMs, including LLaDA, Dream, and LLaDA-V.
-
Scalable to long sequences, maintaining efficiency as input length grows.
[✅] Serve diffusion LLMs with Elastic-Cache and batch inference
[✅] Triton implementation
[🚀] Integrate into additional models (e.g., MMaDA)
[🚀] Elastic-Cache v2
.
├── dream/ # Dream model related code
├── llada/ # LLaDA model related code
└── .gitignore # Git ignore configuration
- Clone the repository:
git clone https://github.com/VILA-Lab/elastic-cache.git
cd elastic-cache- Install dependencies:
pip install -r requirements.txtParameter descriptions:
--gen_length: Maximum length of generated text.--window_size: Sliding window length, less than or equal to gen_length. If less than gen_length, it means using semi_autoregressive remasking.--threshold: Confidence-aware decoding threshold.--gamma: Cache update trigger threshold.--track_num: number of most-attended tokens used for cache update trigger.--block_caching: block caching far-away [MASK] tokens.
cd llada
bash eval_{task}.shcd dream
bash eval_{task}.shThis repository is built upon LLaDA, Dream, LLaDA-V, and the lm-evaluation-harness.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
@article{nguyen2025attention,
title={Attention is all you need for kv cache in diffusion llms},
author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
journal={arXiv preprint arXiv:2510.14973},
year={2025}
}