arm64: fix stack transform for complex functions and LR scratch usage#1
arm64: fix stack transform for complex functions and LR scratch usage#1
Conversation
The stack transform for ARM64 functions with complex stack manipulation (multiple callee-saved registers, stack alignment) was producing incorrect assembly that caused 'unexpected return pc' crashes at runtime. Two bugs were fixed: 1. Asymmetric prologue/epilogue handling: The transform NOPed SUB sp instructions in prologues but not the corresponding ADD sp instructions in epilogues. This left the stack pointer incorrectly adjusted on return. 2. Inconsistent callee-saved register handling: The 'simple' path correctly NOPed all callee-saved register save/restore (X19-X30 pairs), but the 'complex' path kept them with incorrect SP-relative offsets. The fix: - Add ADD to the SP-modifying instruction case (alongside AND/SUB) - NOP all callee-saved register pairs in both paths using shared helpers - Make isCalleeSavedRegPair robust: accept both register orderings (ascending x19,x20 or descending x20,x19) and verify SP base - Add isCalleeSavedReg helper for single-register STR/LDR - Remove 'experimental transform' warnings since transform now works This is correct for Go's ABI0 which has no callee-saved registers - the Go assembler manages the stack frame via the TEXT declaration. Fixes crashes in gocc-generated SIMD functions like indexFoldRabinKarp. Amp-Thread-ID: https://ampcode.com/threads/T-019bb16a-c1a7-708b-bd03-d7c53379ff68 Co-authored-by: Amp <amp@ampcode.com>
20553dc to
c7fdb82
Compare
Three fixes for ARM64 stack handling: 1. Remove -mstackrealign from ARM64 clang flags (clang.go) ARM64 doesn't need it because Go guarantees 16-byte stack alignment matching AAPCS64. The flag causes complex frame-pointer patterns. 2. Fix STP/LDP stack size detection (stack_transform.go) Only treat STP/LDP as stack allocation when using SP writeback mode (AddrPreIndex like [sp, #-N]!). Plain [sp, #N] without writeback is just a save/restore at an offset, not an allocation. 3. Remove -arch x86_64 and sysroot flags for Darwin (arch.go) These flags are Apple clang specific and break Homebrew clang. The --target flag is sufficient for cross-compilation. Fixes stack corruption crashes when clang generates Pattern B: sub sp, sp, #160 ; explicit allocation stp x29, x30, [sp, #80] ; save at offset (no writeback) ... add sp, sp, #160 ; explicit deallocation Previously the STP was incorrectly overwriting extraStack to -80.
C code using alloca() or variable-length arrays generates dynamic stack allocation like 'sub sp, sp, x8' where the size is in a register. This is impossible to handle because Go requires stack size to be known at compile time. Detect this pattern and panic with a clear error message instead of silently producing corrupted stack frames.
Per AAPCS64, x30 is NOT callee-saved and may be used as a general-purpose scratch register by the callee, provided the original LR value is saved and restored. When clang runs out of registers, it may use x30 for computation, which corrupts the return address if the save/restore is NOPed. This change: - Adds usesLRAsScratch() to detect when LR is used as a scratch register (excludes RET, BL/BLR, STR/STP, LDR/LDP which have special semantics) - Adds isLRStackSaveRestore() to identify SP-based LR save/restore patterns - Uses isLRStackSaveRestore consistently in both !complexManip and complexManip paths to preserve LR save/restore when needed - Handles both arm64asm.Reg and arm64asm.RegSP types for x30/w30 Tests added: - TestArm64LRUsedAsScratchSTR: STR-based LR save with scratch use - TestArm64LRUsedAsScratchSTP: STP-based LR save with scratch use - TestArm64LRNotUsedAsScratch: Normal case where LR save is NOPed - TestArm64LRScratchWithSPInstruction: LR scratch in SP-referencing instruction - TestArm64LRSaveNotSPBased: Non-SP-based LR save edge case Fixes incorrect binary encodings in test data for stp x20,x19 instructions. References: - AAPCS64: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst - ARM registers: https://developer.arm.com/documentation/102374/latest/ Amp-Thread-ID: https://ampcode.com/threads/T-019bb1c4-4ca2-75f9-85be-caab9358ab02 Co-authored-by: Amp <amp@ampcode.com>
| assert.NotEqual(t, "NOP", modified.Lines[6].Disassembled, "str x8,[sp] should be transformed, not NOPed") | ||
| assert.NotEqual(t, "NOP", modified.Lines[7].Disassembled, "ldr x8,[sp] should be transformed, not NOPed") |
There was a problem hiding this comment.
These should be testing the actual instruction, not != NOP
There was a problem hiding this comment.
plus the instruction needs to be using go's virtual SP, not the hardware one
There was a problem hiding this comment.
Fixed in bab0296 - tests now assert exact instructions like MOVD R8, 8(RSP) instead of != NOP.
There was a problem hiding this comment.
Fixed in bab0296 - tests now assert exact instructions like MOVD R8, 8(RSP) instead of != NOP.
There was a problem hiding this comment.
Actually, RSP is correct here. Go assembly has two stack pointer concepts:
- SP (pseudo-register): Requires a symbol prefix like
x0spill-32(SP)for frame-relative addressing. Per go.dev/doc/asm: "references to SP require a symbol" - RSP (hardware register): Used for direct stack memory access like
8(RSP)
The Go runtime itself uses RSP for direct stack operations - e.g., in runtime/asm_arm64.s:
STP (R7, R3), 8(RSP)
Using SP without a symbol causes "cannot reference SP without a symbol" errors.
The tests now show both patterns:
MOVD R8, 8(RSP)- direct stack access (complexManip path)MOVD R0, x0spill-32(SP)- symbolic frame-relative (non-complexManip path with spill naming)
There was a problem hiding this comment.
To clarify when gocc uses each:
RSP (hardware SP): Used in the complexManip path when the function has dynamic stack patterns (like mov sp, x29 for stack restore). Here gocc can't easily compute symbolic offsets, so it preserves the raw hardware addressing.
SP (virtual SP with symbol): Used in the simpler non-complexManip path where gocc can compute frame-relative offsets and generate proper spill names like x0spill-32(SP).
The key constraint is Go's assembler - it requires SP references to have a symbol prefix. Using bare 8(SP) fails to assemble.
There was a problem hiding this comment.
yes, Go requires a symbol prefix, so generate one
gocc should never ouput instructions working with hardware SP, C stack operations addressing <0(SP) to (stack_size)SP> need to be transformed into Golang's <name-stack_size(SP) to name-0(SP)>
Per review feedback, test assertions now verify the exact instruction output rather than just checking it's not NOP. This better documents the expected behavior and catches regressions more precisely. For stack memory operations: - Simple stack allocation uses RSP (hardware SP) for direct access - Non-callee-saved spills use symbolic SP with spill naming - LR save/restore when used as scratch uses RSP
Clang sometimes generates VLAs using an indirect pattern: mov x8, sp ; copy SP to temp sub x8, x8, x12 ; subtract dynamic size from temp mov sp, x8 ; write back to SP This pattern was not detected by the existing 'sub sp, sp, reg' check. The fix detects 'mov sp, <reg>' where reg is NOT x29 (frame pointer). 'mov sp, x29' is valid for frame pointer restore in epilogues, but any other register indicates dynamic stack allocation (VLA/alloca). This fixes SIGBUS crashes in generated code by properly rejecting unsupported VLA patterns at generation time with a clear error.
Use a two-pass approach to distinguish prologue/epilogue saves from data stores: Pass 1: Collect all callee-saved STP/LDP/STR/LDR instructions, tracking their registers, stack offsets, and whether they are stores or loads. Pass 2: Only NOP a callee-saved instruction if there is a matching store+load pair with the same registers at the same offset. This ensures data stores to callee-saved registers are preserved. The x30UsedAsScratch logic is preserved: if x30 is used as scratch, its save/restore is kept; otherwise it's NOPed if it's part of a matching pair. Fixes incorrect NOPing of STP/LDP instructions that were data stores rather than prologue/epilogue register saves. Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99 Co-authored-by: Amp <amp@ampcode.com>
The previous fix only applied the callee-saved store/load pair matching to the !complexManip path. Functions with SUB sp instructions (like indexFoldNEONC with $368 frame size) go through the complexManip path, which was still using the old isCalleeSavedRegPair logic that NOPed all callee-saved pairs unconditionally. Apply the same two-pass approach to the complexManip path: - Pass 1: Collect callee-saved STP/LDP/STR/LDR info - Pass 2: Only NOP when both store AND load exist at same (regs, offset) This fixes nil pointer crashes in generated code where data stores using callee-saved registers were incorrectly NOPed. Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99
bbe5ae0 to
d9ca362
Compare
Add TestArm64LRUsedAsScratchADD to verify that when x30 is used as a scratch register via ADD (3-register form like 'add x30, x7, x11'), the LR save/restore is correctly preserved. This pattern is common in clang-generated code when register pressure is high (e.g., veloz's indexFoldNEONC function). Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99 Co-authored-by: Amp <amp@ampcode.com>
…ills) Only match single-register STR/LDR for LR (x30) in the callee-save matching logic. Single-register STR/LDR of x19-x28 are typically intra-function spills (scratch register saves) that MUST be preserved. Clang saves x19-x28 via STP (paired) in prologues, so single-register STR/LDR of these registers are not prologue saves - they're mid-function spills where a callee-saved register is temporarily used as scratch. This fixes the bug described in ISSUE-incorrect-nop-stack-spills.md where gocc was incorrectly NOPing intra-function spills like: str x26, [sp, #48] // save x26 before using as scratch ... // use x26 as scratch ldr x26, [sp, #48] // restore x26 These were being NOPed because they matched as a store+load pair, but they're essential for correctness when the register is clobbered between the save and restore. Amp-Thread-ID: https://ampcode.com/threads/T-019bb306-be1e-77ab-a26e-3d7c1a731b99 Co-authored-by: Amp <amp@ampcode.com>
Add TestArm64IntraFunctionSpillPreserved which reproduces the bug where single-register STR/LDR of x19-x28 with pre/post-index addressing were incorrectly NOPed, causing data corruption. The test uses pre-index STR and post-index LDR because the ARM64 decoder returns different base register representations for different addressing modes: - Offset mode: base = X15 (doesn't match SP comparison) - Pre/post-index: base = SP (matches SP comparison) The old buggy code would match both STR and LDR as callee-save pairs (since both have the same register X26 and effective offset 0) and NOP them, destroying intra-function spills. This test fails with the old buggy code and passes with the fix. Amp-Thread-ID: https://ampcode.com/threads/T-019bb33e-3b3f-73ec-87a0-06ffbe2deb4f
Functions using __attribute__((noinline, flatten)) or similar GNU attributes were not being parsed because the C parser didn't have a definition for __attribute__. Add a no-op macro definition.
ARM64 Clang emits .xword directives for 8-byte constants. Without support for this directive, constants using .xword were silently skipped, causing 'relocation target not defined' errors when the generated code referenced them. Add xword to: - ConstAttrs lists for both amd64 and arm64 architectures - constSizes map with size 8 (same as quad) - directiveParseRegex pattern Add test case with fixture demonstrating multiple constants (one using .xword, one using .byte) attached to the same function. Amp-Thread-ID: https://ampcode.com/threads/T-019bb4c5-2c57-779d-a7ad-f84605721ba7 Co-authored-by: Amp <amp@ampcode.com>
Summary
Fixes ARM64 stack transform to correctly handle complex stack manipulation patterns and functions that use LR (x30) as a scratch register.
Changes
1. Fix asymmetric prologue/epilogue handling (c7fdb82)
SUB spinstructions in prologues but not correspondingADD spin epilogues2. Fix callee-saved register handling (c7fdb82)
isCalleeSavedRegPairrobust: accepts both register orderings (x19,x20 or x20,x19)isCalleeSavedReghelper for single-register STR/LDR3. Fix non-writeback STP/LDP patterns (d699e28)
stp x29, x30, [sp, #N](offset mode) in addition to[sp, #-N]!(pre-index)sub spfollowed by offset-mode STP now work correctly4. Panic on dynamic stack allocation (78db401)
sub sp, sp, xN(register operand) patterns from alloca()/VLAs5. Preserve LR save/restore when used as scratch (03ed52d)
Per AAPCS64, x30 is NOT callee-saved and may be used as a GP scratch register.
When clang runs out of registers, it uses x30 for computation, corrupting the
return address if save/restore is NOPed.
usesLRAsScratch()to detect LR scratch usage (excludes RET, BL/BLR, STR/STP, LDR/LDP)isLRStackSaveRestore()to identify SP-based LR save/restore patternsarm64asm.Regandarm64asm.RegSPtypes for x30/w30isLRStackSaveRestoreconsistently in both code pathsWhy This Is Correct
Go's ABI0 has no callee-saved registers - the Go assembler manages the stack frame via the TEXT declaration. We can safely NOP all C-compiler-generated callee-saved register save/restore sequences, except when the compiler uses a "callee-saved" register (like x30) as a scratch register.
Testing
Added comprehensive ARM64 stack transform tests:
TestArm64LeafFunction- no stack manipulationTestArm64SimpleFramePointer- minimal 16-byte frameTestArm64CalleeSavedRegisters- x19-x28 save/restoreTestArm64SimpleStackAllocation- symmetric sub/add patternTestArm64ComplexFramePointer- mov x29,sp / mov sp,x29 patternTestArm64FramePointerRestore- SP restore via frame pointerTestArm64SingleRegisterCalleeSaved- individual STR/LDRTestArm64RegisterPairOrdering- ascending/descending orderingsTestArm64NonCalleeSavedRegisters- x0-x15 should be keptTestArm64LRUsedAsScratchSTR- STR-based LR save with scratchTestArm64LRUsedAsScratchSTP- STP-based LR save with scratchTestArm64LRNotUsedAsScratch- normal case, LR NOPedTestArm64LRScratchWithSPInstruction- LR scratch in SP instructionTestArm64LRSaveNotSPBased- non-SP-based LR save edge caseTestArm64DynamicStackAllocationPanics- alloca/VLA detectionAll tests pass.
References
Fixes crashes in gocc-generated SIMD functions like
indexFoldRabinKarp.