ptx: Retain arrays until kernel is done executing #111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The remote memory table implementation in Accelerate was not used properly in the PTX backend; the result was that arrays were marked as unused (and thus eligible for freeing or evicting from GPU memory) before the kernel that used those arrays actually ran. It seems (see below in the testing section) that this resulted in unsoundness in practice where a program returned an incorrect result without any signal or error indicating something was off.
This PR fixes this problem. However, it does so in a somewhat hacky way in order to avoid having to rearchitect the entire PTX backend. See the comment above
getCudaDevicePtrhere for more details on the design.Motivation and context
accelerate-llvm/accelerate-llvm-ptx/src/Data/Array/Accelerate/LLVM/PTX/Execute/Marshal.hs
Lines 51 to 55 in c469554
How has this been tested?
The existing full testsuite passes, as well as now (for the first time!) all 4 tests in the current version of accelerate-tests.
The only reproducer we had for this problem, despite its seeming gravity, was a very large unreadable program that gave incorrect results sometimes on my GTX 1050 Ti (which isn't even supported any more by Cuda 13). The only solace is that especially with a smaller nursery (and thus more frequent GCs) such as
+RTS -A1M, the problem reproduced very often, almost every run.I have not fully ascertained that this PR really fixes all unsoundness issues in that adbench-gmmgrad program, nor have I really ascertained that this PR is fully sound. However, I believe that the PR is indeed sound, and the adbench-gmmgrad program does reliably fail before this PR and succeed after this PR. It seems something, at least, improves with this PR.
The nice thing is that after this PR, I do not know of any reproducible unsoundness issues any more in Accelerate.
Types of changes
Checklist: