Add mc fixes #734

klei22 · 2026-02-01T07:43:43Z

This pull request adds new dataset variants and transformation methods for the Shakespeare character-level datasets, improves dataset preparation scripts, and enhances the flexibility and robustness of the data transformation utilities. The changes also update the demo script to use these new datasets and methods, and improve dependency checking and model loading in demo and sampling scripts.

New dataset variants and transformation methods:

Added three new dataset variants: shakespeare_char_case_map, shakespeare_char_lowercase, and shakespeare_char_newlines_mod, each with their own get_dataset.sh scripts and configuration to use new transformation methods. [1] [2] [3]
Implemented new transformation methods in char_convert.py: lowercase, case_map, and newlines_mod, allowing for more flexible character-level preprocessing. These methods emit appropriate token lists and handle conversion logic. [1] [2] [3] [4] [5]

Dataset preparation and utilities:

Updated dataset folders to use shared template scripts for prepare.py and utils, reducing code duplication and ensuring consistency across datasets. [1] [2] [3]
Improved spaCy model loading in char_convert.py by using a lazy initialization pattern and clear error messaging for missing models. [1] [2]

Demo and sampling script improvements:

Enhanced multicontext_demo.sh to check for spaCy and its English model before running, and to include the new dataset variants in training and sampling. Also updated sampling parameters for more comprehensive output. [1] [2]
Added a --weights_only option to sample.py and updated checkpoint loading logic for compatibility with different versions of PyTorch. [1] [2]

Minor fixes:

Fixed a bug in train loss and standard deviation calculation in train.py by removing unnecessary conversion to NumPy arrays.

Copilot

Pull request overview

This PR enhances the multi-context character-level dataset framework by introducing three new Shakespeare dataset variants (case_map, lowercase, and newlines_mod) with corresponding transformation methods. The changes also improve error handling and dependency checking, update the demo script to showcase the new datasets, and fix a bug in train loss calculation.

Changes:

Added three new dataset variants with transformation methods: case_map (maps characters to 'L'/'U'/'_'), lowercase (converts text to lowercase), and newlines_mod (encodes position relative to newlines)
Improved spaCy dependency management with lazy loading and better error messages
Updated demo script to check for spaCy availability and include new datasets in training/sampling
Fixed train loss calculation bug and enhanced checkpoint loading compatibility

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
train.py	Fixed bug by removing unnecessary np.array conversion in loss calculation
sample.py	Added weights_only parameter and improved checkpoint loading compatibility
demos/multicontext_demo.sh	Added spaCy dependency checking and integrated new dataset variants
data/template/utils/char_convert.py	Implemented new transformation methods and lazy spaCy initialization
data/shakespeare_char_newlines_mod/*	Added new dataset variant with symlinks to shared utilities
data/shakespeare_char_lowercase/*	Added new dataset variant with symlinks to shared utilities
data/shakespeare_char_case_map/*	Added new dataset variant with symlinks to shared utilities

Comments suppressed due to low confidence (1)

data/shakespeare_char_newlines_mod/utils:1

The symlink path '../template/utils/' has a trailing slash, which is inconsistent with other similar symlinks in the codebase (shakespeare_char_lowercase/utils and shakespeare_char_case_map/utils use '../template/utils' without a trailing slash). For consistency, remove the trailing slash.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sample.py

data/template/utils/char_convert.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

klei22 and others added 6 commits January 31, 2026 15:20

Add lowercase and case-map Shakespeare datasets

fa728f3

Add fixes for mc demo errors

cb9f3ab

Add newlines mod multicontext dataset

74422ce

Small arch and out dir updates

46770d1

Add lowercase and case map mc transforms

51ea93e

Add fix for char in word emb

cf43203

klei22 requested review from Copilot and gkielian February 1, 2026 07:43

Copilot AI reviewed Feb 1, 2026

View reviewed changes

sample.py Outdated Show resolved Hide resolved

sample.py Outdated Show resolved Hide resolved

data/template/utils/char_convert.py Outdated Show resolved Hide resolved

gkielian and others added 3 commits February 4, 2026 00:13

Update sample.py

d9451b3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sample.py

6bbbda8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update data/template/utils/char_convert.py

0a7b1be

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gkielian approved these changes Feb 4, 2026

View reviewed changes

gkielian merged commit cc4e365 into ReaLLMASIC:master Feb 4, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mc fixes #734

Add mc fixes #734

Uh oh!

klei22 commented Feb 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add mc fixes #734

Add mc fixes #734

Uh oh!

Conversation

klei22 commented Feb 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants