Skip to content

Some perf optimizations#1

Open
wtfnukee wants to merge 1 commit intoKripner:mainfrom
wtfnukee:perf
Open

Some perf optimizations#1
wtfnukee wants to merge 1 commit intoKripner:mainfrom
wtfnukee:perf

Conversation

@wtfnukee
Copy link

@wtfnukee wtfnukee commented Jan 5, 2026

Great work on nanoproof! Really enjoyed reading this.
However, i found some inefficiencies and fixed them:

  • Cache prior_sum() in UCB calculation (avoids O(n^2) recomputation)
  • Fuse KV cache replication with repeat_interleave (single GPU kernel)
  • Pre-allocate decode_mask to avoid allocations in decode loop
  • Add torch.compile with reduce-overhead mode for CUDA
    I tried to be minimally disruptive with changes, it seems to work with some speedup, I'll try to benchmark properly and report later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant