Skip to content

Conversation

@shreed27
Copy link

@shreed27 shreed27 commented Feb 1, 2026

Summary: This PR boosts Gemma's robustness and performance by validating Tokenizer.encode() inputs and caching model checkpoint loads.

Motivation/Context:

  1. Tokenizer Input Validation (Issue Tokenizer: clarify intended input validation and error handling for invalid inputs #513): Previous Tokenizer.encode() error handling was inconsistent and obscure. This PR adds explicit, early TypeError validation with clear messages, improving developer experience and debugging.
  2. Checkpoint Loading Optimization: Repeatedly loading checkpoints incurred overhead. Caching significantly improves performance.

Key Changes:

  • Tokenizer.encode(): Now strictly enforces str or list[str] inputs, raising informative TypeError for invalid types.
  • Checkpoint Caching: Introduced _load_cached_params with @functools.lru_cache(maxsize=128) in _checkpoint.py. load_params now delegates initial Orbax restoration to this cached function, optimizing disk I/O.

Impact/Benefits:

  • Improved DX: Clear TypeError for tokenizer misuse, faster development cycles due to cached checkpoint loading.
  • Enhanced Code Robustness: Prevents cascading errors from invalid tokenizer inputs.
  • Performance Optimization: Reduces redundant checkpoint loading.
  • Best Practices: Aligns with defensive programming and performance optimization principles.

Testing Strategy:

  • Tokenizer Validation: Unit tests (TestTokenizer.test_encode_invalid_inputs) confirm TypeError and message accuracy for invalid inputs.
  • Environment Stability: Temporarily disabled use_hermetic_tokenizer fixture to isolate tokenizer tests due to a missing test model, then reverted. Performance of lru_cache will be benchmarked separately.

Reviewer Guidance:

  • Assess clarity and conciseness of Tokenizer.encode() TypeError messages.
  • Verify _checkpoint.py caching logic for path, text_only, quantize combinations.
  • Evaluate lru_cache maxsize=128 suitability.

Closes #513

@google-cla
Copy link

google-cla bot commented Feb 1, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@shreed27 shreed27 force-pushed the fix/issue-513-tokenizer-checkpoints branch from 7073d6a to d367d5a Compare February 1, 2026 14:42
@shreed27 shreed27 changed the title Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) Feb 1, 2026
@shreed27 shreed27 force-pushed the fix/issue-513-tokenizer-checkpoints branch from d367d5a to af4886a Compare February 2, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tokenizer: clarify intended input validation and error handling for invalid inputs

1 participant