ptx: Prevent Context use-after-free in finalizers #113
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
See the Note in accelerate-llvm-ptx/src/Data/Array/Accelerate/LLVM/PTX/Context.hs .
How has this been tested?
This is somewhat tricky to test, as one needs to let GC run after the PTX context has no users any more. This was my test file:
{-# LANGUAGE OverloadedStrings #-} module Main where import Control.Concurrent (threadDelay) import Control.Monad import System.IO (hFlush, stdout) import qualified Data.Array.Accelerate as A import qualified Data.Array.Accelerate.Debug.Internal as A import qualified Data.Array.Accelerate.LLVM.PTX as GPU main :: IO () main = do print $ GPU.run $ A.sum (A.generate (A.I1 10000) (\(A.I1 i) -> A.toFloating i :: A.Exp Float)) forM_ [1..5] $ \_ -> do threadDelay 1000000 putChar '*' >> hFlush stdout A.traceM A.verbose "done"Furthermore, I added additional debug prints in the finalizers of arrays and modules — as far as I can tell these are the only places where a finalizer uses a CUDA context. These manual prints were necessary because simply passing
+ACC -ddump-gcmade the problem disappear, seemingly because more things were retained somehow.The program above reliably fails on my machine (cuda 12) and Jizo (cuda 13) before this PR, and reliably succeeds after; furthermore, my debug prints indicate that finalization order is indeed nondeterministic between the 1 module, 2 arrays and 1 context allocated in the above program — I've observed every possible order (apart from the two arrays, which I didn't bother to distinguish in the output). The STM-based synchronisation introduced in this PR seems to properly ensure that resources are explicitly freed only if the
Contextisn't already destroyed.No automated test was added because this is tricky to do in an automated setting where the context is retained over invocations; creating a new context for the test would be possible. Do we want that?
Types of changes
Checklist: