Add resi dual experiment and support #692

klei22 · 2025-12-02T23:34:44Z

This pull request introduces support for a dual-residual (ResiDual) Transformer architecture alongside the standard residual path. The changes allow toggling between these two variants via configuration, add the necessary normalization and forward logic for the dual-residual path, and update argument parsing and experiment setup to enable comparative studies. The most important changes are grouped below:

ResiDual architecture implementation:

Added a new use_residual_dual configuration option to GPTConfig, argument parsing (train_args.py), and experiment YAML (residual_dual_comparison.yaml) to toggle the dual-residual path. [1] [2] [3]
Implemented the dual-residual forward logic in Block via the new residual_dual_forward function, with appropriate normalization and residual combining. This is late-bound to the block variations registry. [1] [2] [3] [4]
Updated the model initialization and forward passes (model.py) to support the dual-residual path, including additional normalization layers and handling of steering vectors and learned embeddings for both residual streams. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Experiment configuration and comparison:

Added a new YAML file (residual_dual_comparison.yaml) to define experiment groups and shared parameters for comparing the standard and ResiDual architectures.

Block variations and normalization setup:

Registered the residual_dual variant in block forward, normalization, and residual scaler setup registries to ensure the correct handling of dual-residual logic. [1] [2] [3]

These changes collectively enable flexible experimentation with the ResiDual architecture and ensure proper integration within the model and training pipeline.

Copilot

Pull request overview

This PR introduces a dual-residual (ResiDual) Transformer architecture as an experimental variant alongside the standard residual path. The implementation adds a second residual stream that accumulates attention and MLP outputs separately from the normalized stream, enabling comparative studies between architectural approaches.

Key Changes:

Added use_residual_dual configuration flag to toggle the dual-residual architecture
Implemented residual_dual_forward function that maintains two parallel streams: a normalized stream and an accumulating dual residual stream
Modified all three forward methods in model.py to conditionally support dual-residual processing with appropriate normalization at initialization and output

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
gpt_conf.py	Added `use_residual_dual` boolean configuration field to GPTConfig
train_args.py	Added command-line argument `--use_residual_dual` to enable the ResiDual architecture
variations/block_variations.py	Implemented `residual_dual_forward` function, registered it in variation dictionaries, and added conditional logic for dual-residual normalization and gradient checkpointing
model.py	Modified initialization to auto-configure post-layer norms when dual-residual is enabled, added dual_input_norm and dual_output_norm layers, and updated all forward methods to handle tuple inputs/outputs for dual streams
explorations/residual_dual_comparison.yaml	Created experiment configuration to compare standard pre-LN baseline against the ResiDual variant with identical hyperparameters

Comments suppressed due to low confidence (2)

model.py:492

When use_residual_dual is enabled and learned position embeddings (LPE) are used, the LPE blocks will inherit the use_residual_dual=True setting (via deep copy on line 57), causing them to expect tuple inputs (x_norm, dual_resid). However, on line 492, the LPE forward method is called with a single tensor x, which will cause a runtime error. The LPE blocks should either be explicitly configured to not use dual residual mode, or the input should be adapted to match the expected format.

                if self.config.n_lpe != 0 and self.config.target_layer_in_lpe == layer_idx:
                    for lpe in self.learned_position_embeddings:
                        out = lpe(b, t, x, iter_num)

variations/block_variations.py:390

Variable combined is not used.

        combined = block.post_ln_mlp(combined)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-02T23:40:06Z

model.py

                if self.use_ln_f_input_mixer:
                    layer_outputs.append(x)


When use_ln_f_input_mixer is enabled with dual residual mode, the mixer receives normalized intermediate representations (x) rather than the actual residual streams. This may not produce the intended behavior, as the mixer operates on the normalized stream and then the dual stream is added afterward (line 715). Consider whether the mixer should operate on dual instead of x, or if this feature combination should be explicitly disallowed.

Copilot · 2025-12-02T23:40:07Z

model.py

            if self.use_ln_f_input_mixer:
                layer_outputs.append(x)


When use_ln_f_input_mixer is enabled with dual residual mode, the mixer receives normalized intermediate representations (x) rather than the actual residual streams. This may not produce the intended behavior, as the mixer operates on the normalized stream and then the dual stream is added afterward (line 835). Consider whether the mixer should operate on dual instead of x, or if this feature combination should be explicitly disallowed.

Copilot · 2025-12-02T23:40:07Z

variations/block_variations.py

+    """Dual-residual forward used for ResiDual configuration."""
+


[nitpick] The residual_dual_forward function signature differs from other block forward functions. It accepts and returns a tuple of tensors instead of a single tensor. Consider adding more detailed documentation about the dual-residual architecture, explaining what x_norm and dual_resid represent and how they flow through the network, to help future maintainers understand this architectural variant.

Suggested change

"""Dual-residual forward used for ResiDual configuration."""

"""

Dual-residual forward used for ResiDual configuration.

This function implements a dual-residual architecture, where two tensors are propagated:

- `x_norm`: The normalized input tensor at the current layer, used for attention and MLP computations.

- `dual_resid`: The accumulated residual tensor, which aggregates outputs from attention and MLP branches across layers.

Unlike standard block forward functions that accept and return a single tensor, this function

accepts and returns a tuple of tensors `(x_norm, dual_resid)`. This is necessary to support

the dual-residual mechanism, where both the normalized input and the running residual are

updated and passed to subsequent layers.

Flow:

- `attn_out` is computed from `x_norm` and added to both `x_norm` (to form `attn_res`) and `dual_resid`.

- `mlp_out` is computed from `attn_res` and added to both `attn_res` (to form `combined`) and `dual_resid`.

- `dual_resid` accumulates both `attn_out` and `mlp_out`.

- `x_norm` is updated by normalizing the new `dual_resid` via `block.dual_residual_norm`.

Returns:

Tuple[torch.Tensor, torch.Tensor]: Updated (`x_norm`, `dual_resid`) for the next layer.

This function should be used only in models configured for dual-residual flow.

"""

Copilot · 2025-12-02T23:40:07Z

variations/block_variations.py

+    combined = attn_res + mlp_out
+    if block.use_post_ln_mlp:
+        combined = block.post_ln_mlp(combined)


The combined variable is computed but never used. This appears to be dead code that should either be removed or the variable should be utilized in the return statement or dual_resid computation. Consider removing lines 388-390 if they serve no purpose in the dual-residual architecture.

Suggested change

combined = attn_res + mlp_out

if block.use_post_ln_mlp:

combined = block.post_ln_mlp(combined)

# Removed unused 'combined' computation (dead code)

# if block.use_post_ln_mlp:

# combined = block.post_ln_mlp(combined)

Copilot · 2025-12-02T23:40:07Z

model.py

+                        x = self.obtain_vector_from_layer_output(x)

                if self.use_ln_f_input_mixer:
                    layer_outputs.append(x)


When use_ln_f_input_mixer is enabled with dual residual mode, the mixer receives normalized intermediate representations (x) rather than the actual residual streams. This may not produce the intended behavior, as the mixer operates on the normalized stream and then the dual stream is added afterward (line 537). Consider whether the mixer should operate on dual instead of x, or if this feature combination should be explicitly disallowed.

Suggested change

layer_outputs.append(x)

if self.config.use_residual_dual:

layer_outputs.append(dual)

else:

layer_outputs.append(x)

klei22 and others added 3 commits December 2, 2025 12:02

Add ResiDual dual residual option and comparison config

3545b3c

Expose dual residual option in CLI and tidy comparison config

f08e8b6

Update dual comparison

473c849

klei22 requested review from Copilot and gkielian December 2, 2025 23:34

Copilot started reviewing on behalf of klei22 December 2, 2025 23:35 View session

Copilot finished reviewing on behalf of klei22 December 2, 2025 23:38

Copilot AI reviewed Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resi dual experiment and support #692

Add resi dual experiment and support #692

Uh oh!

klei22 commented Dec 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 2, 2025

Uh oh!

Copilot AI Dec 2, 2025

Uh oh!

Copilot AI Dec 2, 2025

Uh oh!

Copilot AI Dec 2, 2025

Uh oh!

Copilot AI Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"""Dual-residual forward used for ResiDual configuration."""

-    """Dual-residual forward used for ResiDual configuration."""
+    """
+    Dual-residual forward used for ResiDual configuration.
+    This function implements a dual-residual architecture, where two tensors are propagated:
+    - `x_norm`: The normalized input tensor at the current layer, used for attention and MLP computations.
+    - `dual_resid`: The accumulated residual tensor, which aggregates outputs from attention and MLP branches across layers.
+    Unlike standard block forward functions that accept and return a single tensor, this function
+    accepts and returns a tuple of tensors `(x_norm, dual_resid)`. This is necessary to support
+    the dual-residual mechanism, where both the normalized input and the running residual are
+    updated and passed to subsequent layers.
+    Flow:
+    - `attn_out` is computed from `x_norm` and added to both `x_norm` (to form `attn_res`) and `dual_resid`.
+    - `mlp_out` is computed from `attn_res` and added to both `attn_res` (to form `combined`) and `dual_resid`.
+    - `dual_resid` accumulates both `attn_out` and `mlp_out`.
+    - `x_norm` is updated by normalizing the new `dual_resid` via `block.dual_residual_norm`.
+    Returns:
+        Tuple[torch.Tensor, torch.Tensor]: Updated (`x_norm`, `dual_resid`) for the next layer.
+    This function should be used only in models configured for dual-residual flow.
+    """

Add resi dual experiment and support #692

Are you sure you want to change the base?

Add resi dual experiment and support #692

Uh oh!

Conversation

klei22 commented Dec 2, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant