Improve performance of RealSHT / InverseRealSHT#835
Conversation
|
I ran some local tests at different resolutions. At 1/4 degree with a batch size of 32 I got 1485s -> 947s, and at 2 degrees with a batch size I can't recall (larger than 512 though, maybe 1024) I got 491s -> 275s. So the improvement appears to be independent of grid resolution, so long as the GPU is occupied (I got a 0% speed-up on some experiments with very low batch size). These are also lower bounds for speed-up, since I have not guaranteed full GPU occupancy in these benchmarks. |
| @@ -0,0 +1,12 @@ | |||
| { | |||
There was a problem hiding this comment.
TODO: these benchmark files are added to help with the review, but need to be deleted before merging.
|
A more modest speedup unfortunately (5%) in the benchmark I ran on Jupiter on H100s, but still a speedup: https://wandb.ai/ai2cm/fme-core-benchmarks baseline run is 13f5b2, run for this branch is f51f0e. Update: looks like on Titan the result is the opposite, about a 5% slow-down. For both of these benchmarks, it's not clear to me if the GPU is occupied - the run time is a few ms compared to ~50 on the T4. |
This PR updates RealSHT and InverseRealSHT to improve performance. I see speed ups of ~50% (in the SHT itself, not total) on benchmarks replicating the size of our data during n=512 production runs.
Changes:
Added benchmark classes for RealSHT and InverseRealSHT
Added CPU timing to benchmarking, to corroborate GPU total times
Tests added