Skip to content

Conversation

@yitzhakc
Copy link
Contributor

Add optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) that persist across Django server reloads during development. This significantly speeds up the development iteration cycle by avoiding re-initialization of these objects on every code change.

New settings:

  • USE_DEV_FILE_CACHE: Enable/disable the feature (default: False)
  • DEV_FILE_CACHE_DIR: Cache directory (default: /tmp/sefaria_dev_cache)

Modified methods to use file cache:

  • Library.build_full_auto_completer()
  • Library.build_lexicon_auto_completers()
  • Library.build_cross_lexicon_auto_completer()
  • Library.build_linker()
  • TopicManager.build_slug_to_pools_cache()

Added management command:

  • python manage.py clear_dev_cache

Description

A brief description of the PR

Code Changes

The following changes were made to the files below

Notes

Any additional notes go here

Add optional file-based caching for expensive library objects
(AutoCompleter, Linker, topic pools) that persist across Django
server reloads during development. This significantly speeds up
the development iteration cycle by avoiding re-initialization of
these objects on every code change.

New settings:
- USE_DEV_FILE_CACHE: Enable/disable the feature (default: False)
- DEV_FILE_CACHE_DIR: Cache directory (default: /tmp/sefaria_dev_cache)

Modified methods to use file cache:
- Library.build_full_auto_completer()
- Library.build_lexicon_auto_completers()
- Library.build_cross_lexicon_auto_completer()
- Library.build_linker()
- TopicManager.build_slug_to_pools_cache()

Added management command:
- python manage.py clear_dev_cache

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) during development to speed up Django server reloads. The feature uses pickle serialization to persist objects to disk at /tmp/sefaria_dev_cache.

Changes:

  • Added file-based cache functions in sefaria/system/cache.py using pickle serialization
  • Modified Library methods to check file cache before rebuilding objects
  • Added management command clear_dev_cache to manage cached entries

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
sefaria/system/cache.py Added file cache functions using pickle for save/load/clear operations
sefaria/settings.py Added default settings for USE_DEV_FILE_CACHE and DEV_FILE_CACHE_DIR
sefaria/local_settings_example.py Added documentation and example configuration for dev file cache
sefaria/model/text.py Integrated file cache into AutoCompleter and Linker build methods
django_topics/models/topic.py Integrated file cache into TopicManager's slug_to_pools cache building
reader/management/commands/clear_dev_cache.py Added management command to clear and list cache entries

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# File-based caching for library objects (AutoCompleter, Linker) during development.
# When enabled, these expensive objects are serialized to disk after first build
# and loaded from disk on subsequent server reloads, significantly speeding up
# development iteration. Should NOT be used in production.
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding documentation about when developers should clear the cache, such as after database changes, model updates, or when cached objects appear stale. This could be included in the docstring or as a comment in local_settings_example.py to help developers avoid debugging stale cache issues.

Suggested change
# development iteration. Should NOT be used in production.
# development iteration. Should NOT be used in production.
# Cached files are stored under DEV_FILE_CACHE_DIR; if you change the database contents,
# update models/serialization for these objects, or observe stale behavior, clear this
# directory (delete its contents) so that fresh cache files will be rebuilt.

Copilot uses AI. Check for mistakes.
Comment on lines +287 to +288
import pickle
from pathlib import Path
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pickle import should be placed at the top of the file with other standard library imports for better code organization and adherence to PEP 8 standards. Currently it's imported in the middle of the file after other code sections.

Copilot uses AI. Check for mistakes.
Comment on lines +322 to +325
try:
cache_path = _get_dev_cache_path(name)
with open(cache_path, 'wb') as f:
pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL)
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pickle to deserialize untrusted data is a security risk as it can lead to arbitrary code execution. While this is intended for development use only, consider adding a warning in the documentation or using a safer serialization format like JSON where possible. At minimum, ensure the cache directory has appropriate permissions and is not world-writable.

Copilot uses AI. Check for mistakes.
self.style.WARNING(
'Dev file cache is not enabled. Set USE_DEV_FILE_CACHE = True in local_settings.py'
)
)
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command will still attempt to list or clear the cache even when USE_DEV_FILE_CACHE is False, which could be confusing. Consider returning early after displaying the warning message to prevent executing the cache operations when the feature is disabled.

Suggested change
)
)
return

Copilot uses AI. Check for mistakes.
# =============================================================================

import pickle
from pathlib import Path
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Path import should be placed at the top of the file with other imports for better code organization and adherence to PEP 8 standards. Currently it's imported in the middle of the file after other code sections.

Copilot uses AI. Check for mistakes.
# Default values for settings that may be overridden in local_settings.
# These ensure the application works even if local_settings doesn't define them.
USE_DEV_FILE_CACHE = False
DEV_FILE_CACHE_DIR = "/tmp/sefaria_dev_cache"
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using /tmp as the default cache directory may cause issues on systems where /tmp is cleaned on reboot or in containerized environments. Consider using a more persistent location or documenting that developers should configure DEV_FILE_CACHE_DIR in local_settings.py for reliable caching across system restarts.

Suggested change
DEV_FILE_CACHE_DIR = "/tmp/sefaria_dev_cache"
DEV_FILE_CACHE_DIR = os.path.join(os.path.expanduser("~"), ".sefaria_dev_cache")

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +34
cached = load_from_dev_file_cache("topic_slug_to_pools")
if cached is not None:
self.slug_to_pools = cached
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After loading from cache, slug_to_pools is a regular dict, but when built from scratch it's a defaultdict(list). This inconsistency could cause issues. Consider either: 1) wrapping the cached dict in defaultdict(list, cached) after loading, or 2) always using a regular dict and avoiding defaultdict behavior.

Copilot uses AI. Check for mistakes.
def _get_dev_file_cache_dir() -> Path:
"""Get the development file cache directory from settings."""
cache_dir = Path(getattr(settings, 'DEV_FILE_CACHE_DIR', '/tmp/sefaria_dev_cache'))
cache_dir.mkdir(parents=True, exist_ok=True)
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache directory is created with default permissions using mkdir(parents=True, exist_ok=True). For security, consider explicitly setting restrictive permissions (e.g., mode=0o700) to ensure the cache directory is only accessible by the application user, preventing unauthorized access to pickled objects.

Suggested change
cache_dir.mkdir(parents=True, exist_ok=True)
cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700)
try:
cache_dir.chmod(0o700)
except Exception as e:
# Log a warning but do not fail if we cannot enforce strict permissions
logger.warning(f"Unable to set permissions on dev file cache dir {cache_dir}: {e}")

Copilot uses AI. Check for mistakes.
@mergify
Copy link

mergify bot commented Jan 19, 2026

🧪 CI Insights

Here's what we observed from your CI run for 1baf6d6.

🟢 All jobs passed!

But CI Insights is watching 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant