Add dual-chunkstate mode to P2P transport for local busy-polling - up to 10% latency reduction in SendRecv bench #508
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This diff adds an experimental dual-chunkstate mode to P2pNvlTransportDevice that enables local busy-polling instead of remote NVLink polling, achieving up to 10% latency reduction.
Motivation
In the original single-chunkstate design, the sender busy-waits on REMOTE memory via NVLink to check if the receiver has freed the buffer. This NVLink round-trip adds latency and may congest NVLink path, especially for larger message sizes.
Solution: Dual-Chunkstate Mode
When
useDualStateBuffer = true, we use two ChunkState buffers per peer:This allows BOTH sender and receiver to poll LOCAL memory, eliminating NVLink round-trip latency during busy-wait synchronization.
State Machine Comparison
Single State Mode (original):
Dual State Mode (new):
chunkRecvAck_is a persistent local tracking array for sender to know what ack value to wait for from the receiver. This array tracks the expected ack value for each chunk. For both sender and receiver to agree on the same ack value, we have following requirement for user:IMPORTANT: User Requirements for Dual State Mode
To achieve both correctness and efficiency, the dual-state mode requires users to maintain a global
call_indexcounter:This requirement exists because:
chunkRecvAck_) persists in device memory across kernel launchesI acknowledge this sacrifices user experience for correctness and efficiency. Welcome suggestions for alternative designs.
Performance Results
Benchmarked using
p2p_nvl_benchmarkwith bidirectional transfer: see details in test plan:Next
Differential Revision: D91619478