Skip to content

VILA-Lab/Elastic-Cache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elastic-Cache

Project arXiv

Elastic-Cache is a training-free framework designed to accelerate diffusion language models through efficient KV caching. It features:

  • Fast and accurate KV caching for diffusion LLMs, achieving up to 45× speedup over non-accelerated baselines with only a minor drop in accuracy.

  • Layer-aware and time-aware KV caching, enabling the model to determine where and when caching is most effective.

  • An automatic caching mechanism that analyzes attention drift, removing the need for predefined cache schedules used in prior work.

Why Elastic-Cache?

  • Ready-to-use, training-free component for accelerating diffusion LLMs.

  • Provides a controllable trade-off between accuracy and latency.

  • Architecture-agnostic, supporting various open-source diffusion LLMs, including LLaDA, Dream, and LLaDA-V.

  • Scalable to long sequences, maintaining efficiency as input length grows.

Next Steps

[✅] Serve diffusion LLMs with Elastic-Cache and batch inference

[✅] Triton implementation

[🚀] Integrate into additional models (e.g., MMaDA)

[🚀] Elastic-Cache v2

Project Structure

.
├── dream/          # Dream model related code
├── llada/          # LLaDA model related code
└── .gitignore      # Git ignore configuration

Installation

  1. Clone the repository:
git clone https://github.com/VILA-Lab/elastic-cache.git
cd elastic-cache
  1. Install dependencies:
pip install -r requirements.txt

Usage

Parameter descriptions:

  • --gen_length: Maximum length of generated text.
  • --window_size: Sliding window length, less than or equal to gen_length. If less than gen_length, it means using semi_autoregressive remasking.
  • --threshold: Confidence-aware decoding threshold.
  • --gamma: Cache update trigger threshold.
  • --track_num: number of most-attended tokens used for cache update trigger.
  • --block_caching: block caching far-away [MASK] tokens.

1. Using LLaDA Model

cd llada
bash eval_{task}.sh

2. Using Dream Model

cd dream
bash eval_{task}.sh

Acknowledgements

This repository is built upon LLaDA, Dream, LLaDA-V, and the lm-evaluation-harness.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

@article{nguyen2025attention,
  title={Attention is all you need for kv cache in diffusion llms},
  author={Nguyen-Tri, Quan and Ranjan, Mukul and Shen, Zhiqiang},
  journal={arXiv preprint arXiv:2510.14973},
  year={2025}
}

About

Official pytorch implementaion for "Attention Is All You Need for KV Cache in Diffusion LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published