-
Notifications
You must be signed in to change notification settings - Fork 28
Add PKL hyperparameter sweep exploration #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add PKL hyperparameter sweep exploration #673
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces PKL (parameterized as a + b * scale) number representations for linear layers and embeddings in the GPT model. The key goal is to provide an alternative weight parameterization that represents weights as a + b * scale instead of a single weight matrix.
Key changes include:
- Implementation of
PKLLinearandPKLEmbeddingclasses with dual-component weight representation - Configuration options for controlling PKL usage in linear layers, embeddings, and the LM head
- Weight tying support for PKL representations with separate handling for dual components
- YAML configuration file for experimental PKL number sweeps
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| variations/linear_variations.py | Adds PKLLinear and PKLEmbedding classes, FlashNorm wrapper support for PKL, and registers PKL in the linear dictionary |
| train_args.py | Adds command-line arguments for PKL scale factors and feature flags |
| model.py | Integrates PKL embeddings and linear layers into model initialization with weight tying logic and validation checks |
| gpt_conf.py | Adds PKL configuration fields to GPTConfig class |
| explorations/pkl_number_sweep.yaml | Defines experimental sweep configurations for PKL variants |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @property | ||
| def weight(self): | ||
| scale = self.scale.to(self.embedding_b.weight.dtype) | ||
| return self.embedding_a.weight + scale * self.embedding_b.weight |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weight property recomputes the combined weight on every access. If this property is accessed frequently (e.g., during weight tying checks or logging), this could create unnecessary overhead. Consider caching the result or documenting that this is intentionally recomputed to reflect any parameter updates.
| @property | ||
| def weight(self): | ||
| scale = self.scale.to(self.linear_b.weight.dtype) | ||
| return self.linear_a.weight + scale * self.linear_b.weight |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The weight property recomputes the combined weight on every access. If this property is accessed frequently (e.g., during weight tying checks or logging), this could create unnecessary overhead. Consider caching the result or documenting that this is intentionally recomputed to reflect any parameter updates.
| lm_head.linear_a.weight = wte.embedding_a.weight | ||
| lm_head.linear_b.weight = wte.embedding_b.weight |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weight tying assigns linear_a.weight and linear_b.weight directly to embedding_a.weight and embedding_b.weight, but this doesn't tie the scale buffers. If the PKLEmbedding and PKLLinear have different scale values (via pkl_wte_scale vs pkl_lm_head_scale), the effective weights will differ even though the underlying components are tied. Either ensure scales match or document this behavior.
| self.lm_head.linear_a.weight = self.transformer.wte.embedding_a.weight | ||
| self.lm_head.linear_b.weight = self.transformer.wte.embedding_b.weight |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weight tying assigns linear_a.weight and linear_b.weight directly to embedding_a.weight and embedding_b.weight, but this doesn't tie the scale buffers. If the PKLEmbedding and PKLLinear have different scale values (via pkl_wte_scale vs pkl_lm_head_scale), the effective weights will differ even though the underlying components are tied. Either ensure scales match or document this behavior.
| """Linear layer that represents weights as a + b * scale.""" | ||
|
|
||
| def __init__(self, in_features, out_features, config=None, method=None, bits=None, bias=None, scale=None, **kwargs): | ||
| super().__init__() |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bias parameter is accepted but always set to None (line 422), which could be confusing to API consumers. Consider either removing the bias parameter from the signature or raising an error if bias is set to a non-None value to make the API clearer.
| super().__init__() | |
| super().__init__() | |
| if bias is not None: | |
| raise ValueError("PKLLinear does not support a bias term; bias must be None.") |
| if config.use_pkl_wte != config.use_pkl_lm_head: | ||
| raise ValueError("Weight tying with PKL numbers requires both WTE and LM head to use PKL representations.") |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This validation check only occurs when wte_weight_tying is True. Consider also validating this condition earlier (e.g., right after WTE and LM head initialization) to fail fast and provide clearer error messages before weight initialization occurs.
| from variations.position_encoding_variations import QuantizedEmbedding, RotaryEmbedding, SymmetricalOverlapAngularPositions, FIRE | ||
| from variations.activation_variations import activation_dictionary | ||
| from variations.linear_variations import linear_dictionary | ||
| from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'linear_dictionary' is not used.
| from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear | |
| from variations.linear_variations import PKLEmbedding, PKLLinear |
This pull request introduces support for PKL-number parameterization in both linear and embedding layers across the codebase, enabling new experimental configurations for model training. PKL-number layers represent weights as a combination of two tensors (a + b * scale), and this update adds configuration options, implementation, and integration for these layers. It also updates the training argument parser and sweep configuration to allow easy experimentation with PKL variants and scaling factors.
The most important changes are:
PKL-number Linear and Embedding Layer Implementation:
PKLLinearandPKLEmbeddingclasses tovariations/linear_variations.py, which represent weights asa + b * scale, including support for custom scaling, and integrated them into thelinear_dictionaryfor easy selection. [1] [2]Model Integration and Logic Updates:
model.pyto support PKL-number embeddings (use_pkl_wte) and LM head (use_pkl_lm_head), including correct instantiation, weight tying, and error handling for unsupported operations (e.g., quantization and numpy import/export). [1] [2] [3] [4] [5] [6] [7]Configuration and Argument Parsing:
GPTConfigingpt_conf.pyand the training argument parser intrain_args.pyto include options for PKL-number layers, such as enabling PKL embeddings/LM head and specifying scaling factors for each. [1] [2] [3]Experiment Sweep Configuration:
explorations/pkl_number_sweep.yaml, to facilitate systematic experimentation with PKL-number variants, scaling factors, and baseline comparisons.Weight Tying and Gain Fusion Support:
model.pyand gain fusion logic invariations/linear_variations.pyto handle PKL-number layers correctly, ensuring proper sharing and scaling of weights in PKL mode. [1] [2]