Add dual-chunkstate mode to P2P transport for local busy-polling - up to 10% latency reduction in SendRecv bench #508

Regina8023 · 2026-01-31T02:54:13Z

Summary:
This diff adds an experimental dual-chunkstate mode to P2pNvlTransportDevice that enables local busy-polling instead of remote NVLink polling, achieving up to 10% latency reduction.

Motivation

In the original single-chunkstate design, the sender busy-waits on REMOTE memory via NVLink to check if the receiver has freed the buffer. This NVLink round-trip adds latency and may congest NVLink path, especially for larger message sizes.

Solution: Dual-Chunkstate Mode

When useDualStateBuffer = true, we use two ChunkState buffers per peer:

myChunkStateBuffer: Receiver waits here (sender writes via NVLink)
peerChunkStateBuffer: Sender waits here (receiver acks via NVLink)

This allows BOTH sender and receiver to poll LOCAL memory, eliminating NVLink round-trip latency during busy-wait synchronization.

State Machine Comparison

Single State Mode (original):

Sender                           Receiver
──────                           ────────
Wait REMOTE for READY_TO_SEND    Wait LOCAL for stepId
  (NVLink poll - slow)             (local poll - fast)
Copy data via NVLink             Copy from local buffer
Signal REMOTE with stepId        Signal LOCAL with READY_TO_SEND

Dual State Mode (new):

Sender                           Receiver
──────                           ────────
Wait LOCAL for ack value         Wait LOCAL for currentStep
  (local poll - fast)              (local poll - fast)
Copy data via NVLink             Copy from local buffer
Update local chunkRecvAck_       Ack sender via NVLink write
Signal peer via NVLink             with currentStep value

chunkRecvAck_ is a persistent local tracking array for sender to know what ack value to wait for from the receiver. This array tracks the expected ack value for each chunk. For both sender and receiver to agree on the same ack value, we have following requirement for user:

IMPORTANT: User Requirements for Dual State Mode

To achieve both correctness and efficiency, the dual-state mode requires users to maintain a global call_index counter:

call_index must be globally unique across ALL send/recv calls, both within a kernel and across multiple kernel launches
The same call_index must be passed to matching send/recv pairs

This requirement exists because:

The receiver ack (chunkRecvAck_) persists in device memory across kernel launches
Mismatched or overlapping call_index values can cause deadlock or data corruption

I acknowledge this sacrifices user experience for correctness and efficiency. Welcome suggestions for alternative designs.

Performance Results

Benchmarked using p2p_nvl_benchmark with bidirectional transfer: see details in test plan:

Mode	Message Size	Latency Reduction
Block	128KB-1GB	1-4%
Cluster	512MB-1GB	10-11%

… to 10% latency reduction in SendRecv bench Summary: This diff adds an experimental dual-chunkstate mode to P2pNvlTransportDevice that enables local busy-polling instead of remote NVLink polling, achieving up to 10% latency reduction. ## Motivation In the original single-chunkstate design, the sender busy-waits on REMOTE memory via NVLink to check if the receiver has freed the buffer. This NVLink round-trip adds latency and may congest NVLink path, especially for larger message sizes. ## Solution: Dual-Chunkstate Mode When `useDualStateBuffer = true`, we use two ChunkState buffers per peer: - **myChunkStateBuffer**: Receiver waits here (sender writes via NVLink) - **peerChunkStateBuffer**: Sender waits here (receiver acks via NVLink) This allows BOTH sender and receiver to poll LOCAL memory, eliminating NVLink round-trip latency during busy-wait synchronization. ### State Machine Comparison **Single State Mode** (original): ``` Sender Receiver ────── ──────── Wait REMOTE for READY_TO_SEND Wait LOCAL for stepId (NVLink poll - slow) (local poll - fast) Copy data via NVLink Copy from local buffer Signal REMOTE with stepId Signal LOCAL with READY_TO_SEND ``` **Dual State Mode** (new): ``` Sender Receiver ────── ──────── Wait LOCAL for ack value Wait LOCAL for currentStep (local poll - fast) (local poll - fast) Copy data via NVLink Copy from local buffer Update local chunkRecvAck_ Ack sender via NVLink write Signal peer via NVLink with currentStep value ``` `chunkRecvAck_` is a persistent local tracking array for sender to know what ack value to wait for from the receiver. This array tracks the expected ack value for each chunk. For both sender and receiver to agree on the same ack value, we have following requirement for user: ## IMPORTANT: User Requirements for Dual State Mode To achieve both correctness and efficiency, the dual-state mode requires users to maintain a global `call_index` counter: 1. **call_index must be globally unique** across ALL send/recv calls, both within a kernel and across multiple kernel launches 2. **The same call_index must be passed to matching send/recv pairs** This requirement exists because: - The receiver ack (`chunkRecvAck_`) persists in device memory across kernel launches - Mismatched or overlapping call_index values can cause deadlock or data corruption **I acknowledge this sacrifices user experience for correctness and efficiency. Welcome suggestions for alternative designs.** ## Performance Results Benchmarked using `p2p_nvl_benchmark` with bidirectional transfer: see details in test plan: | Mode | Message Size | Latency Reduction | |---------|-------------|-------------------| | Block | 128KB-1GB | 1-4% | | Cluster | 512MB-1GB | **10-11%** | ## Next - Is there a more user-friendly design for dual-chunkstate and still as efficient? - Integrate to Ctran SendRecv, alltoallv, send_one/multiple, dispatch to measure perf gain Differential Revision: D91619478

meta-codesync · 2026-01-31T02:54:35Z

@Regina8023 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91619478.

Regina Ren and others added 2 commits January 29, 2026 22:52

Pass p2pNvlTransportDevice to kernel by pointer to allow stateful tra…

4e9833f

…nsport Differential Revision: D91717513

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add dual-chunkstate mode to P2P transport for local busy-polling - up to 10% latency reduction in SendRecv bench #508

Add dual-chunkstate mode to P2P transport for local busy-polling - up to 10% latency reduction in SendRecv bench #508

Uh oh!

Regina8023 commented Jan 31, 2026

Uh oh!

meta-codesync bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add dual-chunkstate mode to P2P transport for local busy-polling - up to 10% latency reduction in SendRecv bench #508

Are you sure you want to change the base?

Add dual-chunkstate mode to P2P transport for local busy-polling - up to 10% latency reduction in SendRecv bench #508

Uh oh!

Conversation

Regina8023 commented Jan 31, 2026

Motivation

Solution: Dual-Chunkstate Mode

State Machine Comparison

IMPORTANT: User Requirements for Dual State Mode

Performance Results

Next

Uh oh!

meta-codesync bot commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant