Skip to content

Conversation

@klei22
Copy link
Collaborator

@klei22 klei22 commented Oct 31, 2025

This pull request introduces support for PKL-number parameterization in both linear and embedding layers across the codebase, enabling new experimental configurations for model training. PKL-number layers represent weights as a combination of two tensors (a + b * scale), and this update adds configuration options, implementation, and integration for these layers. It also updates the training argument parser and sweep configuration to allow easy experimentation with PKL variants and scaling factors.

The most important changes are:

PKL-number Linear and Embedding Layer Implementation:

  • Added PKLLinear and PKLEmbedding classes to variations/linear_variations.py, which represent weights as a + b * scale, including support for custom scaling, and integrated them into the linear_dictionary for easy selection. [1] [2]

Model Integration and Logic Updates:

  • Updated model.py to support PKL-number embeddings (use_pkl_wte) and LM head (use_pkl_lm_head), including correct instantiation, weight tying, and error handling for unsupported operations (e.g., quantization and numpy import/export). [1] [2] [3] [4] [5] [6] [7]

Configuration and Argument Parsing:

  • Extended GPTConfig in gpt_conf.py and the training argument parser in train_args.py to include options for PKL-number layers, such as enabling PKL embeddings/LM head and specifying scaling factors for each. [1] [2] [3]

Experiment Sweep Configuration:

  • Added a new sweep configuration file, explorations/pkl_number_sweep.yaml, to facilitate systematic experimentation with PKL-number variants, scaling factors, and baseline comparisons.

Weight Tying and Gain Fusion Support:

  • Enhanced weight tying logic in model.py and gain fusion logic in variations/linear_variations.py to handle PKL-number layers correctly, ensuring proper sharing and scaling of weights in PKL mode. [1] [2]

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces PKL (parameterized as a + b * scale) number representations for linear layers and embeddings in the GPT model. The key goal is to provide an alternative weight parameterization that represents weights as a + b * scale instead of a single weight matrix.

Key changes include:

  • Implementation of PKLLinear and PKLEmbedding classes with dual-component weight representation
  • Configuration options for controlling PKL usage in linear layers, embeddings, and the LM head
  • Weight tying support for PKL representations with separate handling for dual components
  • YAML configuration file for experimental PKL number sweeps

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
variations/linear_variations.py Adds PKLLinear and PKLEmbedding classes, FlashNorm wrapper support for PKL, and registers PKL in the linear dictionary
train_args.py Adds command-line arguments for PKL scale factors and feature flags
model.py Integrates PKL embeddings and linear layers into model initialization with weight tying logic and validation checks
gpt_conf.py Adds PKL configuration fields to GPTConfig class
explorations/pkl_number_sweep.yaml Defines experimental sweep configurations for PKL variants

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +453 to +456
@property
def weight(self):
scale = self.scale.to(self.embedding_b.weight.dtype)
return self.embedding_a.weight + scale * self.embedding_b.weight
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weight property recomputes the combined weight on every access. If this property is accessed frequently (e.g., during weight tying checks or logging), this could create unnecessary overhead. Consider caching the result or documenting that this is intentionally recomputed to reflect any parameter updates.

Copilot uses AI. Check for mistakes.
Comment on lines +424 to +427
@property
def weight(self):
scale = self.scale.to(self.linear_b.weight.dtype)
return self.linear_a.weight + scale * self.linear_b.weight
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weight property recomputes the combined weight on every access. If this property is accessed frequently (e.g., during weight tying checks or logging), this could create unnecessary overhead. Consider caching the result or documenting that this is intentionally recomputed to reflect any parameter updates.

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +282
lm_head.linear_a.weight = wte.embedding_a.weight
lm_head.linear_b.weight = wte.embedding_b.weight
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weight tying assigns linear_a.weight and linear_b.weight directly to embedding_a.weight and embedding_b.weight, but this doesn't tie the scale buffers. If the PKLEmbedding and PKLLinear have different scale values (via pkl_wte_scale vs pkl_lm_head_scale), the effective weights will differ even though the underlying components are tied. Either ensure scales match or document this behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +287 to +288
self.lm_head.linear_a.weight = self.transformer.wte.embedding_a.weight
self.lm_head.linear_b.weight = self.transformer.wte.embedding_b.weight
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weight tying assigns linear_a.weight and linear_b.weight directly to embedding_a.weight and embedding_b.weight, but this doesn't tie the scale buffers. If the PKLEmbedding and PKLLinear have different scale values (via pkl_wte_scale vs pkl_lm_head_scale), the effective weights will differ even though the underlying components are tied. Either ensure scales match or document this behavior.

Copilot uses AI. Check for mistakes.
"""Linear layer that represents weights as a + b * scale."""

def __init__(self, in_features, out_features, config=None, method=None, bits=None, bias=None, scale=None, **kwargs):
super().__init__()
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bias parameter is accepted but always set to None (line 422), which could be confusing to API consumers. Consider either removing the bias parameter from the signature or raising an error if bias is set to a non-None value to make the API clearer.

Suggested change
super().__init__()
super().__init__()
if bias is not None:
raise ValueError("PKLLinear does not support a bias term; bias must be None.")

Copilot uses AI. Check for mistakes.
Comment on lines +273 to +274
if config.use_pkl_wte != config.use_pkl_lm_head:
raise ValueError("Weight tying with PKL numbers requires both WTE and LM head to use PKL representations.")
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This validation check only occurs when wte_weight_tying is True. Consider also validating this condition earlier (e.g., right after WTE and LM head initialization) to fail fast and provide clearer error messages before weight initialization occurs.

Copilot uses AI. Check for mistakes.
from variations.position_encoding_variations import QuantizedEmbedding, RotaryEmbedding, SymmetricalOverlapAngularPositions, FIRE
from variations.activation_variations import activation_dictionary
from variations.linear_variations import linear_dictionary
from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'linear_dictionary' is not used.

Suggested change
from variations.linear_variations import linear_dictionary, PKLEmbedding, PKLLinear
from variations.linear_variations import PKLEmbedding, PKLLinear

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant