resolved locally in #8 but we should have a GPU runner to verify the correctness is CI and ideally do some fuzz testing.