Skip to content

Conversation

@tomsmeding
Copy link
Member

@tomsmeding tomsmeding commented Nov 16, 2025

Description

The remote memory table implementation in Accelerate was not used properly in the PTX backend; the result was that arrays were marked as unused (and thus eligible for freeing or evicting from GPU memory) before the kernel that used those arrays actually ran. It seems (see below in the testing section) that this resulted in unsoundness in practice where a program returned an incorrect result without any signal or error indicating something was off.

This PR fixes this problem. However, it does so in a somewhat hacky way in order to avoid having to rearchitect the entire PTX backend. See the comment above getCudaDevicePtr here for more details on the design.

Motivation and context

How has this been tested?

The existing full testsuite passes, as well as now (for the first time!) all 4 tests in the current version of accelerate-tests.

The only reproducer we had for this problem, despite its seeming gravity, was a very large unreadable program that gave incorrect results sometimes on my GTX 1050 Ti (which isn't even supported any more by Cuda 13). The only solace is that especially with a smaller nursery (and thus more frequent GCs) such as +RTS -A1M, the problem reproduced very often, almost every run.

I have not fully ascertained that this PR really fixes all unsoundness issues in that adbench-gmmgrad program, nor have I really ascertained that this PR is fully sound. However, I believe that the PR is indeed sound, and the adbench-gmmgrad program does reliably fail before this PR and succeed after this PR. It seems something, at least, improves with this PR.

The nice thing is that after this PR, I do not know of any reproducible unsoundness issues any more in Accelerate.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

The monad was originally a State monad, but only 'get' was ever used.
This is because the state is actually in mutable variables (MVars and
IORefs) inside the monad state. Changing the monad to a Reader monad is
1. more accurate, 2. allows parallelism, and 3. allows unlifting the
LLVM monad into IO, which is helpful (see the next commit).
This fixes a bug where on my GTX 1050 Ti, adbench-gmmgrad in
accelerate-tests nondeterministically returns incorrect results in the
following way:

Correct:
(Vector (Z :. 5) [-1.9073486e-3,-1.9073486e-3,-1.9073486e-3,-1.9073486e-3,-1.9073486e-3],Matrix (Z :. 5 :. 2)
  [ 0.0, 0.0,
    0.0, 0.0,
    0.0, 0.0,
    0.0, 0.0,
    0.0, 0.0],Matrix (Z :. 5 :. 3)
  [ 200.13342, 200.13342, -1.0,
    200.13342, 200.13342, -1.0,
    200.13342, 200.13342, -1.0,
    200.13342, 200.13342, -1.0,
    200.13342, 200.13342, -1.0])

Produced by PTX:
(Vector (Z :. 5) [-4.5776367e-5,-4.5776367e-5,-4.5776367e-5,-4.5776367e-5,-4.5776367e-5],Matrix (Z :. 5 :. 2)
  [ 6.770671, 6.770671,
    6.770671, 6.770671,
    6.770671,      0.0,
         0.0,      0.0,
         0.0,      0.0],Matrix (Z :. 5 :. 3)
  [ 200.13528, 200.13528, -1.0,
    200.13528, 200.13528, -1.0,
    200.13528, 200.13528, -1.0,
    200.13528, 200.13528, -1.0,
    200.13528, 200.13528, -1.0])

The differences in the first vector are likely just floating point
reordering inaccuracies, but the second result is complete bogus.
@ivogabe ivogabe merged commit 19bf351 into AccelerateHS:master Nov 26, 2025
0 of 27 checks passed
@tomsmeding tomsmeding deleted the ptx-fix-deviceptr branch November 26, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants