Skip to content

Improve performance of RealSHT / InverseRealSHT#835

Open
mcgibbon wants to merge 16 commits intomainfrom
feature/benchmark_sht
Open

Improve performance of RealSHT / InverseRealSHT#835
mcgibbon wants to merge 16 commits intomainfrom
feature/benchmark_sht

Conversation

@mcgibbon
Copy link
Contributor

@mcgibbon mcgibbon commented Feb 13, 2026

This PR updates RealSHT and InverseRealSHT to improve performance. I see speed ups of ~50% (in the SHT itself, not total) on benchmarks replicating the size of our data during n=512 production runs.

Changes:

  • Added benchmark classes for RealSHT and InverseRealSHT

  • Added CPU timing to benchmarking, to corroborate GPU total times

  • Tests added

@mcgibbon
Copy link
Contributor Author

I ran some local tests at different resolutions. At 1/4 degree with a batch size of 32 I got 1485s -> 947s, and at 2 degrees with a batch size I can't recall (larger than 512 though, maybe 1024) I got 491s -> 275s. So the improvement appears to be independent of grid resolution, so long as the GPU is occupied (I got a 0% speed-up on some experiments with very low batch size). These are also lower bounds for speed-up, since I have not guaranteed full GPU occupancy in these benchmarks.

@mcgibbon mcgibbon changed the title Feature/benchmark sht Improve performance of RealSHT / InverseRealSHT Feb 13, 2026
@@ -0,0 +1,12 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: these benchmark files are added to help with the review, but need to be deleted before merging.

@mcgibbon mcgibbon marked this pull request as ready for review February 13, 2026 20:05
@mcgibbon mcgibbon changed the base branch from main to feature/benchmark_to_wandb February 17, 2026 16:05
@mcgibbon
Copy link
Contributor Author

mcgibbon commented Feb 17, 2026

A more modest speedup unfortunately (5%) in the benchmark I ran on Jupiter on H100s, but still a speedup: https://wandb.ai/ai2cm/fme-core-benchmarks baseline run is 13f5b2, run for this branch is f51f0e.

Update: looks like on Titan the result is the opposite, about a 5% slow-down.

For both of these benchmarks, it's not clear to me if the GPU is occupied - the run time is a few ms compared to ~50 on the T4.

Base automatically changed from feature/benchmark_to_wandb to main February 18, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments