ptx: Retain arrays until kernel is done executing #111

tomsmeding · 2025-11-16T10:12:13Z

Description

The remote memory table implementation in Accelerate was not used properly in the PTX backend; the result was that arrays were marked as unused (and thus eligible for freeing or evicting from GPU memory) before the kernel that used those arrays actually ran. It seems (see below in the testing section) that this resulted in unsoundness in practice where a program returned an incorrect result without any signal or error indicating something was off.

This PR fixes this problem. However, it does so in a somewhat hacky way in order to avoid having to rearchitect the entire PTX backend. See the comment above getCudaDevicePtr here for more details on the design.

Motivation and context

accelerate-llvm/accelerate-llvm-ptx/src/Data/Array/Accelerate/LLVM/PTX/Execute/Marshal.hs

Lines 51 to 55 in c469554

    
           -- TODO FIXME !!! 
        
           -- 
        
           -- We will probably need to change marshal to be a bracketed function, so that 
        
           -- the garbage collector does not try to evict the array in the middle of 
        
           -- a computation.

The commit description of ace4418

How has this been tested?

The existing full testsuite passes, as well as now (for the first time!) all 4 tests in the current version of accelerate-tests.

The only reproducer we had for this problem, despite its seeming gravity, was a very large unreadable program that gave incorrect results sometimes on my GTX 1050 Ti (which isn't even supported any more by Cuda 13). The only solace is that especially with a smaller nursery (and thus more frequent GCs) such as +RTS -A1M, the problem reproduced very often, almost every run.

I have not fully ascertained that this PR really fixes all unsoundness issues in that adbench-gmmgrad program, nor have I really ascertained that this PR is fully sound. However, I believe that the PR is indeed sound, and the adbench-gmmgrad program does reliably fail before this PR and succeed after this PR. It seems something, at least, improves with this PR.

The nice thing is that after this PR, I do not know of any reproducible unsoundness issues any more in Accelerate.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

The monad was originally a State monad, but only 'get' was ever used. This is because the state is actually in mutable variables (MVars and IORefs) inside the monad state. Changing the monad to a Reader monad is 1. more accurate, 2. allows parallelism, and 3. allows unlifting the LLVM monad into IO, which is helpful (see the next commit).

This fixes a bug where on my GTX 1050 Ti, adbench-gmmgrad in accelerate-tests nondeterministically returns incorrect results in the following way: Correct: (Vector (Z :. 5) [-1.9073486e-3,-1.9073486e-3,-1.9073486e-3,-1.9073486e-3,-1.9073486e-3],Matrix (Z :. 5 :. 2) [ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],Matrix (Z :. 5 :. 3) [ 200.13342, 200.13342, -1.0, 200.13342, 200.13342, -1.0, 200.13342, 200.13342, -1.0, 200.13342, 200.13342, -1.0, 200.13342, 200.13342, -1.0]) Produced by PTX: (Vector (Z :. 5) [-4.5776367e-5,-4.5776367e-5,-4.5776367e-5,-4.5776367e-5,-4.5776367e-5],Matrix (Z :. 5 :. 2) [ 6.770671, 6.770671, 6.770671, 6.770671, 6.770671, 0.0, 0.0, 0.0, 0.0, 0.0],Matrix (Z :. 5 :. 3) [ 200.13528, 200.13528, -1.0, 200.13528, 200.13528, -1.0, 200.13528, 200.13528, -1.0, 200.13528, 200.13528, -1.0, 200.13528, 200.13528, -1.0]) The differences in the first vector are likely just floating point reordering inaccuracies, but the second result is complete bogus.

tomsmeding requested a review from ivogabe November 16, 2025 10:12

tomsmeding force-pushed the ptx-fix-deviceptr branch from 35991a9 to a2f8727 Compare November 20, 2025 22:37

tomsmeding added 3 commits November 25, 2025 22:25

PTX Async: Also touch kernel using new generic cleanup handler

fd3949d

tomsmeding force-pushed the ptx-fix-deviceptr branch from a2f8727 to fd3949d Compare November 25, 2025 21:44

ptx: rename donevar to doneVar (style)

cd94310

ivogabe merged commit 19bf351 into AccelerateHS:master Nov 26, 2025
0 of 27 checks passed

tomsmeding deleted the ptx-fix-deviceptr branch November 26, 2025 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ptx: Retain arrays until kernel is done executing #111

ptx: Retain arrays until kernel is done executing #111

Uh oh!

tomsmeding commented Nov 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	-- TODO FIXME !!!
	--
	-- We will probably need to change marshal to be a bracketed function, so that
	-- the garbage collector does not try to evict the array in the middle of
	-- a computation.

ptx: Retain arrays until kernel is done executing #111

ptx: Retain arrays until kernel is done executing #111

Uh oh!

Conversation

tomsmeding commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and context

How has this been tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomsmeding commented Nov 16, 2025 •

edited

Loading