Tokenizer Optimization: Global Caching & Auto-Download #523
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR addresses multiple TODOs related to tokenizer performance and usability. It introduces a global LRU cache for the SentencePiece model to prevent redundant loading and parsing when instantiating multiple
Tokenizerobjects. Additionally, it upgrades the file caching utility to transparently download or copy files from remote locations (like GCS) if they are not present in the local cache.Changes
gemma/gm/text/_tokenizer.py:_spproperty to use a standalone, globally cached function_load_sp_model._load_sp_modelwith@functools.lru_cacheto ensure the underlying C++ model is loaded only once per unique path + custom tokens combination._update_proto_with_custom_tokensfor better modularity.gemma/gm/utils/_file_cache.py:maybe_get_from_cacheto handle cache misses by attempting to copy the file from theremote_file_path.gemma/gm/utils/_file_cache_test.py:test_cache_miss_downloads_fileto verify that a missing cache file triggers a copy operation from the source.Impact
Verification
_file_cache_test.pyensuring the download/copy logic works as expected._tokenizer_test.pyLogic remains consistent (integration tests depend on env setup but logic is unit-verified).Checklist