-
-
Notifications
You must be signed in to change notification settings - Fork 311
feat(dev): add file-based caching for library objects during development #3029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(dev): add file-based caching for library objects during development #3029
Conversation
Add optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) that persist across Django server reloads during development. This significantly speeds up the development iteration cycle by avoiding re-initialization of these objects on every code change. New settings: - USE_DEV_FILE_CACHE: Enable/disable the feature (default: False) - DEV_FILE_CACHE_DIR: Cache directory (default: /tmp/sefaria_dev_cache) Modified methods to use file cache: - Library.build_full_auto_completer() - Library.build_lexicon_auto_completers() - Library.build_cross_lexicon_auto_completer() - Library.build_linker() - TopicManager.build_slug_to_pools_cache() Added management command: - python manage.py clear_dev_cache Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) during development to speed up Django server reloads. The feature uses pickle serialization to persist objects to disk at /tmp/sefaria_dev_cache.
Changes:
- Added file-based cache functions in
sefaria/system/cache.pyusing pickle serialization - Modified Library methods to check file cache before rebuilding objects
- Added management command
clear_dev_cacheto manage cached entries
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| sefaria/system/cache.py | Added file cache functions using pickle for save/load/clear operations |
| sefaria/settings.py | Added default settings for USE_DEV_FILE_CACHE and DEV_FILE_CACHE_DIR |
| sefaria/local_settings_example.py | Added documentation and example configuration for dev file cache |
| sefaria/model/text.py | Integrated file cache into AutoCompleter and Linker build methods |
| django_topics/models/topic.py | Integrated file cache into TopicManager's slug_to_pools cache building |
| reader/management/commands/clear_dev_cache.py | Added management command to clear and list cache entries |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # File-based caching for library objects (AutoCompleter, Linker) during development. | ||
| # When enabled, these expensive objects are serialized to disk after first build | ||
| # and loaded from disk on subsequent server reloads, significantly speeding up | ||
| # development iteration. Should NOT be used in production. |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding documentation about when developers should clear the cache, such as after database changes, model updates, or when cached objects appear stale. This could be included in the docstring or as a comment in local_settings_example.py to help developers avoid debugging stale cache issues.
| # development iteration. Should NOT be used in production. | |
| # development iteration. Should NOT be used in production. | |
| # Cached files are stored under DEV_FILE_CACHE_DIR; if you change the database contents, | |
| # update models/serialization for these objects, or observe stale behavior, clear this | |
| # directory (delete its contents) so that fresh cache files will be rebuilt. |
| import pickle | ||
| from pathlib import Path |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pickle import should be placed at the top of the file with other standard library imports for better code organization and adherence to PEP 8 standards. Currently it's imported in the middle of the file after other code sections.
| try: | ||
| cache_path = _get_dev_cache_path(name) | ||
| with open(cache_path, 'wb') as f: | ||
| pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pickle to deserialize untrusted data is a security risk as it can lead to arbitrary code execution. While this is intended for development use only, consider adding a warning in the documentation or using a safer serialization format like JSON where possible. At minimum, ensure the cache directory has appropriate permissions and is not world-writable.
| self.style.WARNING( | ||
| 'Dev file cache is not enabled. Set USE_DEV_FILE_CACHE = True in local_settings.py' | ||
| ) | ||
| ) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The command will still attempt to list or clear the cache even when USE_DEV_FILE_CACHE is False, which could be confusing. Consider returning early after displaying the warning message to prevent executing the cache operations when the feature is disabled.
| ) | |
| ) | |
| return |
| # ============================================================================= | ||
|
|
||
| import pickle | ||
| from pathlib import Path |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Path import should be placed at the top of the file with other imports for better code organization and adherence to PEP 8 standards. Currently it's imported in the middle of the file after other code sections.
| # Default values for settings that may be overridden in local_settings. | ||
| # These ensure the application works even if local_settings doesn't define them. | ||
| USE_DEV_FILE_CACHE = False | ||
| DEV_FILE_CACHE_DIR = "/tmp/sefaria_dev_cache" |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using /tmp as the default cache directory may cause issues on systems where /tmp is cleaned on reboot or in containerized environments. Consider using a more persistent location or documenting that developers should configure DEV_FILE_CACHE_DIR in local_settings.py for reliable caching across system restarts.
| DEV_FILE_CACHE_DIR = "/tmp/sefaria_dev_cache" | |
| DEV_FILE_CACHE_DIR = os.path.join(os.path.expanduser("~"), ".sefaria_dev_cache") |
| cached = load_from_dev_file_cache("topic_slug_to_pools") | ||
| if cached is not None: | ||
| self.slug_to_pools = cached |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After loading from cache, slug_to_pools is a regular dict, but when built from scratch it's a defaultdict(list). This inconsistency could cause issues. Consider either: 1) wrapping the cached dict in defaultdict(list, cached) after loading, or 2) always using a regular dict and avoiding defaultdict behavior.
| def _get_dev_file_cache_dir() -> Path: | ||
| """Get the development file cache directory from settings.""" | ||
| cache_dir = Path(getattr(settings, 'DEV_FILE_CACHE_DIR', '/tmp/sefaria_dev_cache')) | ||
| cache_dir.mkdir(parents=True, exist_ok=True) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache directory is created with default permissions using mkdir(parents=True, exist_ok=True). For security, consider explicitly setting restrictive permissions (e.g., mode=0o700) to ensure the cache directory is only accessible by the application user, preventing unauthorized access to pickled objects.
| cache_dir.mkdir(parents=True, exist_ok=True) | |
| cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700) | |
| try: | |
| cache_dir.chmod(0o700) | |
| except Exception as e: | |
| # Log a warning but do not fail if we cannot enforce strict permissions | |
| logger.warning(f"Unable to set permissions on dev file cache dir {cache_dir}: {e}") |
🧪 CI InsightsHere's what we observed from your CI run for 1baf6d6. 🟢 All jobs passed!But CI Insights is watching 👀 |
Add optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) that persist across Django server reloads during development. This significantly speeds up the development iteration cycle by avoiding re-initialization of these objects on every code change.
New settings:
Modified methods to use file cache:
Added management command:
Description
A brief description of the PR
Code Changes
The following changes were made to the files below
Notes
Any additional notes go here