-
Notifications
You must be signed in to change notification settings - Fork 28
Add mc fixes #734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mc fixes #734
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the multi-context character-level dataset framework by introducing three new Shakespeare dataset variants (case_map, lowercase, and newlines_mod) with corresponding transformation methods. The changes also improve error handling and dependency checking, update the demo script to showcase the new datasets, and fix a bug in train loss calculation.
Changes:
- Added three new dataset variants with transformation methods: case_map (maps characters to 'L'/'U'/'_'), lowercase (converts text to lowercase), and newlines_mod (encodes position relative to newlines)
- Improved spaCy dependency management with lazy loading and better error messages
- Updated demo script to check for spaCy availability and include new datasets in training/sampling
- Fixed train loss calculation bug and enhanced checkpoint loading compatibility
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| train.py | Fixed bug by removing unnecessary np.array conversion in loss calculation |
| sample.py | Added weights_only parameter and improved checkpoint loading compatibility |
| demos/multicontext_demo.sh | Added spaCy dependency checking and integrated new dataset variants |
| data/template/utils/char_convert.py | Implemented new transformation methods and lazy spaCy initialization |
| data/shakespeare_char_newlines_mod/* | Added new dataset variant with symlinks to shared utilities |
| data/shakespeare_char_lowercase/* | Added new dataset variant with symlinks to shared utilities |
| data/shakespeare_char_case_map/* | Added new dataset variant with symlinks to shared utilities |
Comments suppressed due to low confidence (1)
data/shakespeare_char_newlines_mod/utils:1
- The symlink path '../template/utils/' has a trailing slash, which is inconsistent with other similar symlinks in the codebase (shakespeare_char_lowercase/utils and shakespeare_char_case_map/utils use '../template/utils' without a trailing slash). For consistency, remove the trailing slash.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This pull request adds new dataset variants and transformation methods for the Shakespeare character-level datasets, improves dataset preparation scripts, and enhances the flexibility and robustness of the data transformation utilities. The changes also update the demo script to use these new datasets and methods, and improve dependency checking and model loading in demo and sampling scripts.
New dataset variants and transformation methods:
shakespeare_char_case_map,shakespeare_char_lowercase, andshakespeare_char_newlines_mod, each with their ownget_dataset.shscripts and configuration to use new transformation methods. [1] [2] [3]char_convert.py:lowercase,case_map, andnewlines_mod, allowing for more flexible character-level preprocessing. These methods emit appropriate token lists and handle conversion logic. [1] [2] [3] [4] [5]Dataset preparation and utilities:
prepare.pyandutils, reducing code duplication and ensuring consistency across datasets. [1] [2] [3]char_convert.pyby using a lazy initialization pattern and clear error messaging for missing models. [1] [2]Demo and sampling script improvements:
multicontext_demo.shto check for spaCy and its English model before running, and to include the new dataset variants in training and sampling. Also updated sampling parameters for more comprehensive output. [1] [2]--weights_onlyoption tosample.pyand updated checkpoint loading logic for compatibility with different versions of PyTorch. [1] [2]Minor fixes:
train.pyby removing unnecessary conversion to NumPy arrays.