Tokenizer Optimization: Global Caching & Auto-Download #523

shreed27 · 2026-02-01T15:11:36Z

Description

This PR addresses multiple TODOs related to tokenizer performance and usability. It introduces a global LRU cache for the SentencePiece model to prevent redundant loading and parsing when instantiating multiple Tokenizer objects. Additionally, it upgrades the file caching utility to transparently download or copy files from remote locations (like GCS) if they are not present in the local cache.

Changes

gemma/gm/text/_tokenizer.py:
- Refactored _sp property to use a standalone, globally cached function _load_sp_model.
- Decorated _load_sp_model with @functools.lru_cache to ensure the underlying C++ model is loaded only once per unique path + custom tokens combination.
- Extracted custom token application logic into _update_proto_with_custom_tokens for better modularity.
gemma/gm/utils/_file_cache.py:
- Enhanced maybe_get_from_cache to handle cache misses by attempting to copy the file from the remote_file_path.
- Added directory creation logic to ensure the cache path exists before writing.
gemma/gm/utils/_file_cache_test.py:
- Added test_cache_miss_downloads_file to verify that a missing cache file triggers a copy operation from the source.

Impact

Performance: significantly reduces initialization time and memory usage when working with multiple tokenizer instances (e.g., in distributed training, evaluation pipelines, or tests).
Usability: seamless handling of remote model paths without manual pre-downloading steps.

Verification

Unit Tests: Added coverage in _file_cache_test.py ensuring the download/copy logic works as expected.
Existing Tests: Verified _tokenizer_test.py Logic remains consistent (integration tests depend on env setup but logic is unit-verified).

Checklist

Implemented global cache for Tokenizer.
Implemented auto-download for file cache.
Added/Updated tests.
Linted code.

…timization

- Implemented for model loading to prevent redundant IO and parsing when creating multiple instances. - Added auto-download capability to : now automatically downloads/copies remote files (e.g., gs://) to the local cache if missing. - Refactored to separate model loading logic into standalone cached functions. - Updated to verify download behavior.

shreed27 added 3 commits February 1, 2026 20:11

Refactor: Enhanced Tokenizer Input Validation & Checkpoint Loading Op…

25bdcbc

…timization

Finalize tokenizer validation and checkpoint caching

d367d5a

shreed27 mentioned this pull request Feb 3, 2026

Tokenizer Optimization: Global Caching & Auto-Download #532

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Optimization: Global Caching & Auto-Download #523

Tokenizer Optimization: Global Caching & Auto-Download #523

Uh oh!

shreed27 commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tokenizer Optimization: Global Caching & Auto-Download #523

Are you sure you want to change the base?

Tokenizer Optimization: Global Caching & Auto-Download #523

Uh oh!

Conversation

shreed27 commented Feb 1, 2026

Description

Changes

Impact

Verification

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant