Feat/chat rolling cache #526

shreed27 · 2026-02-01T15:48:47Z

Chat Sampler: Implement Rolling Cache for Infinite Conversations

Summary

Implements a rolling cache strategy in ChatSampler to support infinite multi-turn conversations. When the context window (cache) capacity is exceeded, the sampler efficiently truncates the oldest history while preserving the most recent turns and the new user prompt, allowing the conversation to continue indefinitely.

Problem Statement

Previously, ChatSampler utilized a fixed-size KV cache (defined by cache_length, default 4096).

Issue: If a conversation's total length (history + new prompt) exceeded this limit, the underlying sampler would fail (e.g., cache.is_full check) or generation would halt immediately.
Experience: Users would hit a hard stop after ~4096 tokens, forcing them to restart the conversation or manually manage state.
Missing Feature: The codebase had an explicit TODO(epot): Support and test rolling cache. validation.

Solution

Implemented a robust "prompt-based" rolling cache mechanism within ChatSampler.chat.

Algorithm

Usage Check: Before sampling, we compare last_state.used_cache_length + new_prompt_length against cache_length.
Rolling Trigger: If the limit is exceeded:
- Reconstruct History: We effectively re-render the full conversation history from self.turns into a single string.
- Truncate: We combine the history with the new prompt, tokenize the entire sequence, and keep only the suffix that fits within cache_length - 64 (a safety buffer).
- Reset State: We force last_state = None. This instructs the underlying Sampler to discard the full KV cache and perform a fresh prefill on the truncated prompt.
- Transparency: If print_stream is enabled, a message is printed to notify the user that rolling occurred.
State Management: self.turns remains intact, preserving the logical history of the conversation (for record-keeping), even though the model's effective context window is truncated.

Impact

Infinite Conversations: Users can now chat indefinitely. The model naturally "forgets" the oldest parts of the conversation once the window is full, similar to standard implementations in other LLM usage.
Robustness: Prevents runtime errors or unexpected silent failures when the context window fills up.

Verification

Logic Verification: Verified the token counting and truncation logic ensures valid inputs to the Sampler.
Behavior: Confirmed that last_state is reset correctly upon overflow, ensuring the Sampler prefills the new truncated context.

…timization

- Implemented for model loading to prevent redundant IO and parsing when creating multiple instances. - Added auto-download capability to : now automatically downloads/copies remote files (e.g., gs://) to the local cache if missing. - Refactored to separate model loading logic into standalone cached functions. - Updated to verify download behavior.

shreed27 added 6 commits February 1, 2026 20:11

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Op…

25bdcbc

…timization

Finalize tokenizer validation and checkpoint caching

d367d5a

fix(data): enable multi-path support for ParquetDataSource

bae2205

feat(sampler): implement memory-efficient top-k logit storage

2c00ade

feat(chat): implement rolling cache logic

57770f0

shreed27 mentioned this pull request Feb 3, 2026

Feat/chat rolling cache #535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/chat rolling cache #526

Feat/chat rolling cache #526

Uh oh!

shreed27 commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feat/chat rolling cache #526

Are you sure you want to change the base?

Feat/chat rolling cache #526

Uh oh!

Conversation

shreed27 commented Feb 1, 2026

Chat Sampler: Implement Rolling Cache for Infinite Conversations

Summary

Problem Statement

Solution

Algorithm

Impact

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant