-
Notifications
You must be signed in to change notification settings - Fork 28
Add resi dual experiment and support #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add resi dual experiment and support #692
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a dual-residual (ResiDual) Transformer architecture as an experimental variant alongside the standard residual path. The implementation adds a second residual stream that accumulates attention and MLP outputs separately from the normalized stream, enabling comparative studies between architectural approaches.
Key Changes:
- Added
use_residual_dualconfiguration flag to toggle the dual-residual architecture - Implemented
residual_dual_forwardfunction that maintains two parallel streams: a normalized stream and an accumulating dual residual stream - Modified all three forward methods in
model.pyto conditionally support dual-residual processing with appropriate normalization at initialization and output
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| gpt_conf.py | Added use_residual_dual boolean configuration field to GPTConfig |
| train_args.py | Added command-line argument --use_residual_dual to enable the ResiDual architecture |
| variations/block_variations.py | Implemented residual_dual_forward function, registered it in variation dictionaries, and added conditional logic for dual-residual normalization and gradient checkpointing |
| model.py | Modified initialization to auto-configure post-layer norms when dual-residual is enabled, added dual_input_norm and dual_output_norm layers, and updated all forward methods to handle tuple inputs/outputs for dual streams |
| explorations/residual_dual_comparison.yaml | Created experiment configuration to compare standard pre-LN baseline against the ResiDual variant with identical hyperparameters |
Comments suppressed due to low confidence (2)
model.py:492
- When
use_residual_dualis enabled and learned position embeddings (LPE) are used, the LPE blocks will inherit theuse_residual_dual=Truesetting (via deep copy on line 57), causing them to expect tuple inputs(x_norm, dual_resid). However, on line 492, the LPE forward method is called with a single tensorx, which will cause a runtime error. The LPE blocks should either be explicitly configured to not use dual residual mode, or the input should be adapted to match the expected format.
if self.config.n_lpe != 0 and self.config.target_layer_in_lpe == layer_idx:
for lpe in self.learned_position_embeddings:
out = lpe(b, t, x, iter_num)
variations/block_variations.py:390
- Variable combined is not used.
combined = block.post_ln_mlp(combined)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self.use_ln_f_input_mixer: | ||
| layer_outputs.append(x) |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When use_ln_f_input_mixer is enabled with dual residual mode, the mixer receives normalized intermediate representations (x) rather than the actual residual streams. This may not produce the intended behavior, as the mixer operates on the normalized stream and then the dual stream is added afterward (line 715). Consider whether the mixer should operate on dual instead of x, or if this feature combination should be explicitly disallowed.
| if self.use_ln_f_input_mixer: | ||
| layer_outputs.append(x) |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When use_ln_f_input_mixer is enabled with dual residual mode, the mixer receives normalized intermediate representations (x) rather than the actual residual streams. This may not produce the intended behavior, as the mixer operates on the normalized stream and then the dual stream is added afterward (line 835). Consider whether the mixer should operate on dual instead of x, or if this feature combination should be explicitly disallowed.
| """Dual-residual forward used for ResiDual configuration.""" | ||
|
|
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The residual_dual_forward function signature differs from other block forward functions. It accepts and returns a tuple of tensors instead of a single tensor. Consider adding more detailed documentation about the dual-residual architecture, explaining what x_norm and dual_resid represent and how they flow through the network, to help future maintainers understand this architectural variant.
| """Dual-residual forward used for ResiDual configuration.""" | |
| """ | |
| Dual-residual forward used for ResiDual configuration. | |
| This function implements a dual-residual architecture, where two tensors are propagated: | |
| - `x_norm`: The normalized input tensor at the current layer, used for attention and MLP computations. | |
| - `dual_resid`: The accumulated residual tensor, which aggregates outputs from attention and MLP branches across layers. | |
| Unlike standard block forward functions that accept and return a single tensor, this function | |
| accepts and returns a tuple of tensors `(x_norm, dual_resid)`. This is necessary to support | |
| the dual-residual mechanism, where both the normalized input and the running residual are | |
| updated and passed to subsequent layers. | |
| Flow: | |
| - `attn_out` is computed from `x_norm` and added to both `x_norm` (to form `attn_res`) and `dual_resid`. | |
| - `mlp_out` is computed from `attn_res` and added to both `attn_res` (to form `combined`) and `dual_resid`. | |
| - `dual_resid` accumulates both `attn_out` and `mlp_out`. | |
| - `x_norm` is updated by normalizing the new `dual_resid` via `block.dual_residual_norm`. | |
| Returns: | |
| Tuple[torch.Tensor, torch.Tensor]: Updated (`x_norm`, `dual_resid`) for the next layer. | |
| This function should be used only in models configured for dual-residual flow. | |
| """ |
| combined = attn_res + mlp_out | ||
| if block.use_post_ln_mlp: | ||
| combined = block.post_ln_mlp(combined) |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combined variable is computed but never used. This appears to be dead code that should either be removed or the variable should be utilized in the return statement or dual_resid computation. Consider removing lines 388-390 if they serve no purpose in the dual-residual architecture.
| combined = attn_res + mlp_out | |
| if block.use_post_ln_mlp: | |
| combined = block.post_ln_mlp(combined) | |
| # Removed unused 'combined' computation (dead code) | |
| # if block.use_post_ln_mlp: | |
| # combined = block.post_ln_mlp(combined) |
| x = self.obtain_vector_from_layer_output(x) | ||
|
|
||
| if self.use_ln_f_input_mixer: | ||
| layer_outputs.append(x) |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When use_ln_f_input_mixer is enabled with dual residual mode, the mixer receives normalized intermediate representations (x) rather than the actual residual streams. This may not produce the intended behavior, as the mixer operates on the normalized stream and then the dual stream is added afterward (line 537). Consider whether the mixer should operate on dual instead of x, or if this feature combination should be explicitly disallowed.
| layer_outputs.append(x) | |
| if self.config.use_residual_dual: | |
| layer_outputs.append(dual) | |
| else: | |
| layer_outputs.append(x) |
This pull request introduces support for a dual-residual (ResiDual) Transformer architecture alongside the standard residual path. The changes allow toggling between these two variants via configuration, add the necessary normalization and forward logic for the dual-residual path, and update argument parsing and experiment setup to enable comparative studies. The most important changes are grouped below:
ResiDual architecture implementation:
use_residual_dualconfiguration option toGPTConfig, argument parsing (train_args.py), and experiment YAML (residual_dual_comparison.yaml) to toggle the dual-residual path. [1] [2] [3]Blockvia the newresidual_dual_forwardfunction, with appropriate normalization and residual combining. This is late-bound to the block variations registry. [1] [2] [3] [4]model.py) to support the dual-residual path, including additional normalization layers and handling of steering vectors and learned embeddings for both residual streams. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]Experiment configuration and comparison:
residual_dual_comparison.yaml) to define experiment groups and shared parameters for comparing the standard and ResiDual architectures.Block variations and normalization setup:
residual_dualvariant in block forward, normalization, and residual scaler setup registries to ensure the correct handling of dual-residual logic. [1] [2] [3]These changes collectively enable flexible experimentation with the ResiDual architecture and ensure proper integration within the model and training pipeline.