Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) #522

shreed27 · 2026-02-01T14:37:58Z

Summary: This PR boosts Gemma's robustness and performance by validating Tokenizer.encode() inputs and caching model checkpoint loads.

Motivation/Context:

Tokenizer Input Validation (Issue Tokenizer: clarify intended input validation and error handling for invalid inputs #513): Previous Tokenizer.encode() error handling was inconsistent and obscure. This PR adds explicit, early TypeError validation with clear messages, improving developer experience and debugging.
Checkpoint Loading Optimization: Repeatedly loading checkpoints incurred overhead. Caching significantly improves performance.

Key Changes:

Tokenizer.encode(): Now strictly enforces str or list[str] inputs, raising informative TypeError for invalid types.
Checkpoint Caching: Introduced _load_cached_params with @functools.lru_cache(maxsize=128) in _checkpoint.py. load_params now delegates initial Orbax restoration to this cached function, optimizing disk I/O.

Impact/Benefits:

Improved DX: Clear TypeError for tokenizer misuse, faster development cycles due to cached checkpoint loading.
Enhanced Code Robustness: Prevents cascading errors from invalid tokenizer inputs.
Performance Optimization: Reduces redundant checkpoint loading.
Best Practices: Aligns with defensive programming and performance optimization principles.

Testing Strategy:

Tokenizer Validation: Unit tests (TestTokenizer.test_encode_invalid_inputs) confirm TypeError and message accuracy for invalid inputs.
Environment Stability: Temporarily disabled use_hermetic_tokenizer fixture to isolate tokenizer tests due to a missing test model, then reverted. Performance of lru_cache will be benchmarked separately.

Reviewer Guidance:

Assess clarity and conciseness of Tokenizer.encode() TypeError messages.
Verify _checkpoint.py caching logic for path, text_only, quantize combinations.
Evaluate lru_cache maxsize=128 suitability.

Closes #513

google-cla · 2026-02-01T14:38:03Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

…timization

shreed27 force-pushed the fix/issue-513-tokenizer-checkpoints branch from 7073d6a to d367d5a Compare February 1, 2026 14:42

shreed27 changed the title ~~Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization~~ Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) Feb 1, 2026

shreed27 added 3 commits February 1, 2026 21:44

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Op…

28047d8

…timization

Implement tokenizer auto-download from GCS

9634d88

feat: add strict flag to Tokenizer.encode for optional validation

af4886a

shreed27 force-pushed the fix/issue-513-tokenizer-checkpoints branch from d367d5a to af4886a Compare February 2, 2026 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) #522

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) #522

Uh oh!

shreed27 commented Feb 1, 2026 •

edited

Loading

Uh oh!

google-cla bot commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) #522

Are you sure you want to change the base?

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Optimization (Fixes #513) #522

Uh oh!

Conversation

shreed27 commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-cla bot commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shreed27 commented Feb 1, 2026 •

edited

Loading