Add PKL hyperparameter sweep exploration #673

klei22 · 2025-10-31T22:54:34Z

This pull request introduces support for PKL-number parameterization in both linear and embedding layers across the codebase, enabling new experimental configurations for model training. PKL-number layers represent weights as a combination of two tensors (a + b * scale), and this update adds configuration options, implementation, and integration for these layers. It also updates the training argument parser and sweep configuration to allow easy experimentation with PKL variants and scaling factors.

The most important changes are:

PKL-number Linear and Embedding Layer Implementation:

Added PKLLinear and PKLEmbedding classes to variations/linear_variations.py, which represent weights as a + b * scale, including support for custom scaling, and integrated them into the linear_dictionary for easy selection. [1] [2]

Model Integration and Logic Updates:

Updated model.py to support PKL-number embeddings (use_pkl_wte) and LM head (use_pkl_lm_head), including correct instantiation, weight tying, and error handling for unsupported operations (e.g., quantization and numpy import/export). [1] [2] [3] [4] [5] [6] [7]

Configuration and Argument Parsing:

Extended GPTConfig in gpt_conf.py and the training argument parser in train_args.py to include options for PKL-number layers, such as enabling PKL embeddings/LM head and specifying scaling factors for each. [1] [2] [3]

Experiment Sweep Configuration:

Added a new sweep configuration file, explorations/pkl_number_sweep.yaml, to facilitate systematic experimentation with PKL-number variants, scaling factors, and baseline comparisons.

Weight Tying and Gain Fusion Support:

Enhanced weight tying logic in model.py and gain fusion logic in variations/linear_variations.py to handle PKL-number layers correctly, ensuring proper sharing and scaling of weights in PKL mode. [1] [2]

Copilot

Pull Request Overview

This pull request introduces PKL (parameterized as a + b * scale) number representations for linear layers and embeddings in the GPT model. The key goal is to provide an alternative weight parameterization that represents weights as a + b * scale instead of a single weight matrix.

Key changes include:

Implementation of PKLLinear and PKLEmbedding classes with dual-component weight representation
Configuration options for controlling PKL usage in linear layers, embeddings, and the LM head
Weight tying support for PKL representations with separate handling for dual components
YAML configuration file for experimental PKL number sweeps

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
variations/linear_variations.py	Adds `PKLLinear` and `PKLEmbedding` classes, FlashNorm wrapper support for PKL, and registers PKL in the linear dictionary
train_args.py	Adds command-line arguments for PKL scale factors and feature flags
model.py	Integrates PKL embeddings and linear layers into model initialization with weight tying logic and validation checks
gpt_conf.py	Adds PKL configuration fields to `GPTConfig` class
explorations/pkl_number_sweep.yaml	Defines experimental sweep configurations for PKL variants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T22:57:13Z

variations/linear_variations.py

+    @property
+    def weight(self):
+        scale = self.scale.to(self.embedding_b.weight.dtype)
+        return self.embedding_a.weight + scale * self.embedding_b.weight


The weight property recomputes the combined weight on every access. If this property is accessed frequently (e.g., during weight tying checks or logging), this could create unnecessary overhead. Consider caching the result or documenting that this is intentionally recomputed to reflect any parameter updates.

Copilot · 2025-10-31T22:57:13Z

variations/linear_variations.py

+    @property
+    def weight(self):
+        scale = self.scale.to(self.linear_b.weight.dtype)
+        return self.linear_a.weight + scale * self.linear_b.weight


The weight property recomputes the combined weight on every access. If this property is accessed frequently (e.g., during weight tying checks or logging), this could create unnecessary overhead. Consider caching the result or documenting that this is intentionally recomputed to reflect any parameter updates.

Copilot · 2025-10-31T22:57:13Z

model.py

+                        lm_head.linear_a.weight = wte.embedding_a.weight
+                        lm_head.linear_b.weight = wte.embedding_b.weight


Weight tying assigns linear_a.weight and linear_b.weight directly to embedding_a.weight and embedding_b.weight, but this doesn't tie the scale buffers. If the PKLEmbedding and PKLLinear have different scale values (via pkl_wte_scale vs pkl_lm_head_scale), the effective weights will differ even though the underlying components are tied. Either ensure scales match or document this behavior.

Copilot · 2025-10-31T22:57:13Z

model.py

+                    self.lm_head.linear_a.weight = self.transformer.wte.embedding_a.weight
+                    self.lm_head.linear_b.weight = self.transformer.wte.embedding_b.weight


Weight tying assigns linear_a.weight and linear_b.weight directly to embedding_a.weight and embedding_b.weight, but this doesn't tie the scale buffers. If the PKLEmbedding and PKLLinear have different scale values (via pkl_wte_scale vs pkl_lm_head_scale), the effective weights will differ even though the underlying components are tied. Either ensure scales match or document this behavior.

Copilot · 2025-10-31T22:57:13Z

variations/linear_variations.py

+    """Linear layer that represents weights as a + b * scale."""
+
+    def __init__(self, in_features, out_features, config=None, method=None, bits=None, bias=None, scale=None, **kwargs):
+        super().__init__()


The bias parameter is accepted but always set to None (line 422), which could be confusing to API consumers. Consider either removing the bias parameter from the signature or raising an error if bias is set to a non-None value to make the API clearer.

Suggested change

super().__init__()

super().__init__()

if bias is not None:

raise ValueError("PKLLinear does not support a bias term; bias must be None.")

Copilot · 2025-10-31T22:57:14Z

model.py

+            if config.use_pkl_wte != config.use_pkl_lm_head:
+                raise ValueError("Weight tying with PKL numbers requires both WTE and LM head to use PKL representations.")


[nitpick] This validation check only occurs when wte_weight_tying is True. Consider also validating this condition earlier (e.g., right after WTE and LM head initialization) to fail fast and provide clearer error messages before weight initialization occurs.

Copilot · 2025-10-31T22:57:14Z

model.py

 from variations.position_encoding_variations import QuantizedEmbedding, RotaryEmbedding, SymmetricalOverlapAngularPositions, FIRE
 from variations.activation_variations import activation_dictionary
-from variations.linear_variations import linear_dictionary
+from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear


Import of 'linear_dictionary' is not used.

Suggested change

from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear

from variations.linear_variations import PKLEmbedding, PKLLinear

Add PKL hyperparameter sweep exploration

8d5c837

klei22 requested review from anandppatel, Copilot and gkielian October 31, 2025 22:54

Copilot AI reviewed Oct 31, 2025

View reviewed changes

klei22 added 2 commits October 31, 2025 19:35

Add PKL minipile evaluation demo script

946cacd

Extend minipile fake PTQ demo for PKL sweeps

901c337

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PKL hyperparameter sweep exploration #673

Add PKL hyperparameter sweep exploration #673

Uh oh!

klei22 commented Oct 31, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		lm_head.linear_a.weight = wte.embedding_a.weight
		lm_head.linear_b.weight = wte.embedding_b.weight

		self.lm_head.linear_a.weight = self.transformer.wte.embedding_a.weight
		self.lm_head.linear_b.weight = self.transformer.wte.embedding_b.weight

		if config.use_pkl_wte != config.use_pkl_lm_head:
		raise ValueError("Weight tying with PKL numbers requires both WTE and LM head to use PKL representations.")

	from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear
	from variations.linear_variations import PKLEmbedding, PKLLinear

Add PKL hyperparameter sweep exploration #673

Are you sure you want to change the base?

Add PKL hyperparameter sweep exploration #673

Uh oh!

Conversation

klei22 commented Oct 31, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant