Skip to content

arm64: fix stack transform for complex functions and LR scratch usage#1

Open
tsenart wants to merge 14 commits intomhr3:mainfrom
tsenart:fix-arm64-stack-transform
Open

arm64: fix stack transform for complex functions and LR scratch usage#1
tsenart wants to merge 14 commits intomhr3:mainfrom
tsenart:fix-arm64-stack-transform

Conversation

@tsenart
Copy link

@tsenart tsenart commented Jan 12, 2026

Summary

Fixes ARM64 stack transform to correctly handle complex stack manipulation patterns and functions that use LR (x30) as a scratch register.

Changes

1. Fix asymmetric prologue/epilogue handling (c7fdb82)

  • NOPed SUB sp instructions in prologues but not corresponding ADD sp in epilogues
  • Now correctly NOPs both, leaving stack management to Go's TEXT declaration

2. Fix callee-saved register handling (c7fdb82)

  • Made isCalleeSavedRegPair robust: accepts both register orderings (x19,x20 or x20,x19)
  • Added isCalleeSavedReg helper for single-register STR/LDR
  • NOP all callee-saved register pairs consistently in both simple and complex paths

3. Fix non-writeback STP/LDP patterns (d699e28)

  • Handle stp x29, x30, [sp, #N] (offset mode) in addition to [sp, #-N]! (pre-index)
  • Functions using sub sp followed by offset-mode STP now work correctly

4. Panic on dynamic stack allocation (78db401)

  • Detect sub sp, sp, xN (register operand) patterns from alloca()/VLAs
  • Panic with clear error message since Go requires stack size at compile time

5. Preserve LR save/restore when used as scratch (03ed52d)

Per AAPCS64, x30 is NOT callee-saved and may be used as a GP scratch register.
When clang runs out of registers, it uses x30 for computation, corrupting the
return address if save/restore is NOPed.

  • Added usesLRAsScratch() to detect LR scratch usage (excludes RET, BL/BLR, STR/STP, LDR/LDP)
  • Added isLRStackSaveRestore() to identify SP-based LR save/restore patterns
  • Handle both arm64asm.Reg and arm64asm.RegSP types for x30/w30
  • Use isLRStackSaveRestore consistently in both code paths

Why This Is Correct

Go's ABI0 has no callee-saved registers - the Go assembler manages the stack frame via the TEXT declaration. We can safely NOP all C-compiler-generated callee-saved register save/restore sequences, except when the compiler uses a "callee-saved" register (like x30) as a scratch register.

Testing

Added comprehensive ARM64 stack transform tests:

  • TestArm64LeafFunction - no stack manipulation
  • TestArm64SimpleFramePointer - minimal 16-byte frame
  • TestArm64CalleeSavedRegisters - x19-x28 save/restore
  • TestArm64SimpleStackAllocation - symmetric sub/add pattern
  • TestArm64ComplexFramePointer - mov x29,sp / mov sp,x29 pattern
  • TestArm64FramePointerRestore - SP restore via frame pointer
  • TestArm64SingleRegisterCalleeSaved - individual STR/LDR
  • TestArm64RegisterPairOrdering - ascending/descending orderings
  • TestArm64NonCalleeSavedRegisters - x0-x15 should be kept
  • TestArm64LRUsedAsScratchSTR - STR-based LR save with scratch
  • TestArm64LRUsedAsScratchSTP - STP-based LR save with scratch
  • TestArm64LRNotUsedAsScratch - normal case, LR NOPed
  • TestArm64LRScratchWithSPInstruction - LR scratch in SP instruction
  • TestArm64LRSaveNotSPBased - non-SP-based LR save edge case
  • TestArm64DynamicStackAllocationPanics - alloca/VLA detection

All tests pass.

References

Fixes crashes in gocc-generated SIMD functions like indexFoldRabinKarp.

The stack transform for ARM64 functions with complex stack manipulation
(multiple callee-saved registers, stack alignment) was producing incorrect
assembly that caused 'unexpected return pc' crashes at runtime.

Two bugs were fixed:

1. Asymmetric prologue/epilogue handling: The transform NOPed SUB sp
   instructions in prologues but not the corresponding ADD sp instructions
   in epilogues. This left the stack pointer incorrectly adjusted on return.

2. Inconsistent callee-saved register handling: The 'simple' path correctly
   NOPed all callee-saved register save/restore (X19-X30 pairs), but the
   'complex' path kept them with incorrect SP-relative offsets.

The fix:
- Add ADD to the SP-modifying instruction case (alongside AND/SUB)
- NOP all callee-saved register pairs in both paths using shared helpers
- Make isCalleeSavedRegPair robust: accept both register orderings
  (ascending x19,x20 or descending x20,x19) and verify SP base
- Add isCalleeSavedReg helper for single-register STR/LDR
- Remove 'experimental transform' warnings since transform now works

This is correct for Go's ABI0 which has no callee-saved registers - the
Go assembler manages the stack frame via the TEXT declaration.

Fixes crashes in gocc-generated SIMD functions like indexFoldRabinKarp.

Amp-Thread-ID: https://ampcode.com/threads/T-019bb16a-c1a7-708b-bd03-d7c53379ff68
Co-authored-by: Amp <amp@ampcode.com>
@tsenart tsenart force-pushed the fix-arm64-stack-transform branch from 20553dc to c7fdb82 Compare January 12, 2026 09:20
tsenart and others added 3 commits January 12, 2026 10:57
Three fixes for ARM64 stack handling:

1. Remove -mstackrealign from ARM64 clang flags (clang.go)
   ARM64 doesn't need it because Go guarantees 16-byte stack alignment
   matching AAPCS64. The flag causes complex frame-pointer patterns.

2. Fix STP/LDP stack size detection (stack_transform.go)
   Only treat STP/LDP as stack allocation when using SP writeback mode
   (AddrPreIndex like [sp, #-N]!). Plain [sp, #N] without writeback is
   just a save/restore at an offset, not an allocation.

3. Remove -arch x86_64 and sysroot flags for Darwin (arch.go)
   These flags are Apple clang specific and break Homebrew clang.
   The --target flag is sufficient for cross-compilation.

Fixes stack corruption crashes when clang generates Pattern B:
  sub sp, sp, #160      ; explicit allocation
  stp x29, x30, [sp, #80]  ; save at offset (no writeback)
  ...
  add sp, sp, #160      ; explicit deallocation

Previously the STP was incorrectly overwriting extraStack to -80.
C code using alloca() or variable-length arrays generates dynamic stack
allocation like 'sub sp, sp, x8' where the size is in a register.
This is impossible to handle because Go requires stack size to be known
at compile time.

Detect this pattern and panic with a clear error message instead of
silently producing corrupted stack frames.
Per AAPCS64, x30 is NOT callee-saved and may be used as a general-purpose
scratch register by the callee, provided the original LR value is saved
and restored. When clang runs out of registers, it may use x30 for
computation, which corrupts the return address if the save/restore is NOPed.

This change:
- Adds usesLRAsScratch() to detect when LR is used as a scratch register
  (excludes RET, BL/BLR, STR/STP, LDR/LDP which have special semantics)
- Adds isLRStackSaveRestore() to identify SP-based LR save/restore patterns
- Uses isLRStackSaveRestore consistently in both !complexManip and
  complexManip paths to preserve LR save/restore when needed
- Handles both arm64asm.Reg and arm64asm.RegSP types for x30/w30

Tests added:
- TestArm64LRUsedAsScratchSTR: STR-based LR save with scratch use
- TestArm64LRUsedAsScratchSTP: STP-based LR save with scratch use
- TestArm64LRNotUsedAsScratch: Normal case where LR save is NOPed
- TestArm64LRScratchWithSPInstruction: LR scratch in SP-referencing instruction
- TestArm64LRSaveNotSPBased: Non-SP-based LR save edge case

Fixes incorrect binary encodings in test data for stp x20,x19 instructions.

References:
- AAPCS64: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst
- ARM registers: https://developer.arm.com/documentation/102374/latest/

Amp-Thread-ID: https://ampcode.com/threads/T-019bb1c4-4ca2-75f9-85be-caab9358ab02
Co-authored-by: Amp <amp@ampcode.com>
@tsenart tsenart changed the title internal/asm: fix ARM64 stack transform for complex functions arm64: fix stack transform for complex functions and LR scratch usage Jan 12, 2026
Comment on lines 175 to 176
assert.NotEqual(t, "NOP", modified.Lines[6].Disassembled, "str x8,[sp] should be transformed, not NOPed")
assert.NotEqual(t, "NOP", modified.Lines[7].Disassembled, "ldr x8,[sp] should be transformed, not NOPed")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be testing the actual instruction, not != NOP

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus the instruction needs to be using go's virtual SP, not the hardware one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bab0296 - tests now assert exact instructions like MOVD R8, 8(RSP) instead of != NOP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bab0296 - tests now assert exact instructions like MOVD R8, 8(RSP) instead of != NOP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, RSP is correct here. Go assembly has two stack pointer concepts:

  • SP (pseudo-register): Requires a symbol prefix like x0spill-32(SP) for frame-relative addressing. Per go.dev/doc/asm: "references to SP require a symbol"
  • RSP (hardware register): Used for direct stack memory access like 8(RSP)

The Go runtime itself uses RSP for direct stack operations - e.g., in runtime/asm_arm64.s:

STP (R7, R3), 8(RSP)

Using SP without a symbol causes "cannot reference SP without a symbol" errors.

The tests now show both patterns:

  • MOVD R8, 8(RSP) - direct stack access (complexManip path)
  • MOVD R0, x0spill-32(SP) - symbolic frame-relative (non-complexManip path with spill naming)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify when gocc uses each:

RSP (hardware SP): Used in the complexManip path when the function has dynamic stack patterns (like mov sp, x29 for stack restore). Here gocc can't easily compute symbolic offsets, so it preserves the raw hardware addressing.

SP (virtual SP with symbol): Used in the simpler non-complexManip path where gocc can compute frame-relative offsets and generate proper spill names like x0spill-32(SP).

The key constraint is Go's assembler - it requires SP references to have a symbol prefix. Using bare 8(SP) fails to assemble.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Go requires a symbol prefix, so generate one

gocc should never ouput instructions working with hardware SP, C stack operations addressing <0(SP) to (stack_size)SP> need to be transformed into Golang's <name-stack_size(SP) to name-0(SP)>

tsenart and others added 4 commits January 12, 2026 12:50
Per review feedback, test assertions now verify the exact instruction
output rather than just checking it's not NOP. This better documents
the expected behavior and catches regressions more precisely.

For stack memory operations:
- Simple stack allocation uses RSP (hardware SP) for direct access
- Non-callee-saved spills use symbolic SP with spill naming
- LR save/restore when used as scratch uses RSP
Clang sometimes generates VLAs using an indirect pattern:
  mov x8, sp       ; copy SP to temp
  sub x8, x8, x12  ; subtract dynamic size from temp
  mov sp, x8       ; write back to SP

This pattern was not detected by the existing 'sub sp, sp, reg' check.

The fix detects 'mov sp, <reg>' where reg is NOT x29 (frame pointer).
'mov sp, x29' is valid for frame pointer restore in epilogues, but
any other register indicates dynamic stack allocation (VLA/alloca).

This fixes SIGBUS crashes in generated code by properly rejecting
unsupported VLA patterns at generation time with a clear error.
Use a two-pass approach to distinguish prologue/epilogue saves from
data stores:

Pass 1: Collect all callee-saved STP/LDP/STR/LDR instructions, tracking
their registers, stack offsets, and whether they are stores or loads.

Pass 2: Only NOP a callee-saved instruction if there is a matching
store+load pair with the same registers at the same offset. This ensures
data stores to callee-saved registers are preserved.

The x30UsedAsScratch logic is preserved: if x30 is used as scratch, its
save/restore is kept; otherwise it's NOPed if it's part of a matching pair.

Fixes incorrect NOPing of STP/LDP instructions that were data stores
rather than prologue/epilogue register saves.

Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99
Co-authored-by: Amp <amp@ampcode.com>
The previous fix only applied the callee-saved store/load pair matching
to the !complexManip path. Functions with SUB sp instructions (like
indexFoldNEONC with $368 frame size) go through the complexManip path,
which was still using the old isCalleeSavedRegPair logic that NOPed all
callee-saved pairs unconditionally.

Apply the same two-pass approach to the complexManip path:
- Pass 1: Collect callee-saved STP/LDP/STR/LDR info
- Pass 2: Only NOP when both store AND load exist at same (regs, offset)

This fixes nil pointer crashes in generated code where data stores using
callee-saved registers were incorrectly NOPed.

Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99
@tsenart tsenart force-pushed the fix-arm64-stack-transform branch from bbe5ae0 to d9ca362 Compare January 12, 2026 16:41
tsenart and others added 6 commits January 12, 2026 17:52
Add TestArm64LRUsedAsScratchADD to verify that when x30 is used as
a scratch register via ADD (3-register form like 'add x30, x7, x11'),
the LR save/restore is correctly preserved.

This pattern is common in clang-generated code when register pressure
is high (e.g., veloz's indexFoldNEONC function).

Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99
Co-authored-by: Amp <amp@ampcode.com>
…ills)

Only match single-register STR/LDR for LR (x30) in the callee-save
matching logic. Single-register STR/LDR of x19-x28 are typically
intra-function spills (scratch register saves) that MUST be preserved.

Clang saves x19-x28 via STP (paired) in prologues, so single-register
STR/LDR of these registers are not prologue saves - they're mid-function
spills where a callee-saved register is temporarily used as scratch.

This fixes the bug described in ISSUE-incorrect-nop-stack-spills.md
where gocc was incorrectly NOPing intra-function spills like:
  str x26, [sp, #48]  // save x26 before using as scratch
  ...                 // use x26 as scratch
  ldr x26, [sp, #48]  // restore x26

These were being NOPed because they matched as a store+load pair,
but they're essential for correctness when the register is clobbered
between the save and restore.

Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99
Co-authored-by: Amp <amp@ampcode.com>
Add TestArm64IntraFunctionSpillPreserved which reproduces the bug where
single-register STR/LDR of x19-x28 with pre/post-index addressing were
incorrectly NOPed, causing data corruption.

The test uses pre-index STR and post-index LDR because the ARM64 decoder
returns different base register representations for different addressing modes:
- Offset mode: base = X15 (doesn't match SP comparison)
- Pre/post-index: base = SP (matches SP comparison)

The old buggy code would match both STR and LDR as callee-save pairs
(since both have the same register X26 and effective offset 0) and NOP them,
destroying intra-function spills.

This test fails with the old buggy code and passes with the fix.

Amp-Thread-ID: https://ampcode.com/threads/T-019bb33e-3b3f-73ec-87a0-06ffbe2deb4f
Functions using __attribute__((noinline, flatten)) or similar GNU
attributes were not being parsed because the C parser didn't have
a definition for __attribute__. Add a no-op macro definition.
ARM64 Clang emits .xword directives for 8-byte constants. Without
support for this directive, constants using .xword were silently
skipped, causing 'relocation target not defined' errors when the
generated code referenced them.

Add xword to:
- ConstAttrs lists for both amd64 and arm64 architectures
- constSizes map with size 8 (same as quad)
- directiveParseRegex pattern

Add test case with fixture demonstrating multiple constants
(one using .xword, one using .byte) attached to the same function.

Amp-Thread-ID: https://ampcode.com/threads/T-019bb4c5-2c57-779d-a7ad-f84605721ba7
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments