Suspect that FLashGS really faster than original 3DGS?

Hi, thanks for your great paper. But if you use warp whose size is 32 will load Gaussians from global memory much more times, which mean the latency should be longer? Using warp instead of tile for parallelization could probably suffice the speed of parallelization?