Some perf optimizations by wtfnukee · Pull Request #1 · Kripner/nanoproof

wtfnukee · 2026-01-05T11:53:16Z

Great work on nanoproof! Really enjoyed reading this.
However, i found some inefficiencies and fixed them:

Cache prior_sum() in UCB calculation (avoids O(n^2) recomputation)
Fuse KV cache replication with repeat_interleave (single GPU kernel)
Pre-allocate decode_mask to avoid allocations in decode loop
Add torch.compile with reduce-overhead mode for CUDA
I tried to be minimally disruptive with changes, it seems to work with some speedup, I'll try to benchmark properly and report later.

some perf optimizations

66e931d

Provide feedback