Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Feb 1, 2026

This pull request adds new dataset variants and transformation methods for the Shakespeare character-level datasets, improves dataset preparation scripts, and enhances the flexibility and robustness of the data transformation utilities. The changes also update the demo script to use these new datasets and methods, and improve dependency checking and model loading in demo and sampling scripts.

New dataset variants and transformation methods:

  • Added three new dataset variants: shakespeare_char_case_map, shakespeare_char_lowercase, and shakespeare_char_newlines_mod, each with their own get_dataset.sh scripts and configuration to use new transformation methods. [1] [2] [3]
  • Implemented new transformation methods in char_convert.py: lowercase, case_map, and newlines_mod, allowing for more flexible character-level preprocessing. These methods emit appropriate token lists and handle conversion logic. [1] [2] [3] [4] [5]

Dataset preparation and utilities:

  • Updated dataset folders to use shared template scripts for prepare.py and utils, reducing code duplication and ensuring consistency across datasets. [1] [2] [3]
  • Improved spaCy model loading in char_convert.py by using a lazy initialization pattern and clear error messaging for missing models. [1] [2]

Demo and sampling script improvements:

  • Enhanced multicontext_demo.sh to check for spaCy and its English model before running, and to include the new dataset variants in training and sampling. Also updated sampling parameters for more comprehensive output. [1] [2]
  • Added a --weights_only option to sample.py and updated checkpoint loading logic for compatibility with different versions of PyTorch. [1] [2]

Minor fixes:

  • Fixed a bug in train loss and standard deviation calculation in train.py by removing unnecessary conversion to NumPy arrays.

@klei22 klei22 requested review from Copilot and gkielian February 1, 2026 07:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the multi-context character-level dataset framework by introducing three new Shakespeare dataset variants (case_map, lowercase, and newlines_mod) with corresponding transformation methods. The changes also improve error handling and dependency checking, update the demo script to showcase the new datasets, and fix a bug in train loss calculation.

Changes:

  • Added three new dataset variants with transformation methods: case_map (maps characters to 'L'/'U'/'_'), lowercase (converts text to lowercase), and newlines_mod (encodes position relative to newlines)
  • Improved spaCy dependency management with lazy loading and better error messages
  • Updated demo script to check for spaCy availability and include new datasets in training/sampling
  • Fixed train loss calculation bug and enhanced checkpoint loading compatibility

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
train.py Fixed bug by removing unnecessary np.array conversion in loss calculation
sample.py Added weights_only parameter and improved checkpoint loading compatibility
demos/multicontext_demo.sh Added spaCy dependency checking and integrated new dataset variants
data/template/utils/char_convert.py Implemented new transformation methods and lazy spaCy initialization
data/shakespeare_char_newlines_mod/* Added new dataset variant with symlinks to shared utilities
data/shakespeare_char_lowercase/* Added new dataset variant with symlinks to shared utilities
data/shakespeare_char_case_map/* Added new dataset variant with symlinks to shared utilities
Comments suppressed due to low confidence (1)

data/shakespeare_char_newlines_mod/utils:1

  • The symlink path '../template/utils/' has a trailing slash, which is inconsistent with other similar symlinks in the codebase (shakespeare_char_lowercase/utils and shakespeare_char_case_map/utils use '../template/utils' without a trailing slash). For consistency, remove the trailing slash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gkielian and others added 3 commits February 4, 2026 00:13
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@gkielian gkielian merged commit cc4e365 into ReaLLMASIC:master Feb 4, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants