Skip to content

Conversation

@shreed27
Copy link

@shreed27 shreed27 commented Feb 1, 2026

Chat Sampler: Implement Rolling Cache for Infinite Conversations

Summary

Implements a rolling cache strategy in ChatSampler to support infinite multi-turn conversations. When the context window (cache) capacity is exceeded, the sampler efficiently truncates the oldest history while preserving the most recent turns and the new user prompt, allowing the conversation to continue indefinitely.

Problem Statement

Previously, ChatSampler utilized a fixed-size KV cache (defined by cache_length, default 4096).

  • Issue: If a conversation's total length (history + new prompt) exceeded this limit, the underlying sampler would fail (e.g., cache.is_full check) or generation would halt immediately.
  • Experience: Users would hit a hard stop after ~4096 tokens, forcing them to restart the conversation or manually manage state.
  • Missing Feature: The codebase had an explicit TODO(epot): Support and test rolling cache. validation.

Solution

Implemented a robust "prompt-based" rolling cache mechanism within ChatSampler.chat.

Algorithm

  1. Usage Check: Before sampling, we compare last_state.used_cache_length + new_prompt_length against cache_length.
  2. Rolling Trigger: If the limit is exceeded:
    • Reconstruct History: We effectively re-render the full conversation history from self.turns into a single string.
    • Truncate: We combine the history with the new prompt, tokenize the entire sequence, and keep only the suffix that fits within cache_length - 64 (a safety buffer).
    • Reset State: We force last_state = None. This instructs the underlying Sampler to discard the full KV cache and perform a fresh prefill on the truncated prompt.
    • Transparency: If print_stream is enabled, a message is printed to notify the user that rolling occurred.
  3. State Management: self.turns remains intact, preserving the logical history of the conversation (for record-keeping), even though the model's effective context window is truncated.

Impact

  • Infinite Conversations: Users can now chat indefinitely. The model naturally "forgets" the oldest parts of the conversation once the window is full, similar to standard implementations in other LLM usage.
  • Robustness: Prevents runtime errors or unexpected silent failures when the context window fills up.

Verification

  • Logic Verification: Verified the token counting and truncation logic ensures valid inputs to the Sampler.
  • Behavior: Confirmed that last_state is reset correctly upon overflow, ensuring the Sampler prefills the new truncated context.

- Implemented  for  model loading to prevent redundant IO and parsing when creating multiple instances.
- Added auto-download capability to : now automatically downloads/copies remote files (e.g., gs://) to the local cache if missing.
- Refactored  to separate model loading logic into standalone cached functions.
- Updated  to verify download behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant