Skip to content

Conversation

@Regina8023
Copy link
Contributor

Summary:
This diff adds an experimental dual-chunkstate mode to P2pNvlTransportDevice that enables local busy-polling instead of remote NVLink polling, achieving up to 10% latency reduction.

Motivation

In the original single-chunkstate design, the sender busy-waits on REMOTE memory via NVLink to check if the receiver has freed the buffer. This NVLink round-trip adds latency and may congest NVLink path, especially for larger message sizes.

Solution: Dual-Chunkstate Mode

When useDualStateBuffer = true, we use two ChunkState buffers per peer:

  • myChunkStateBuffer: Receiver waits here (sender writes via NVLink)
  • peerChunkStateBuffer: Sender waits here (receiver acks via NVLink)

This allows BOTH sender and receiver to poll LOCAL memory, eliminating NVLink round-trip latency during busy-wait synchronization.

State Machine Comparison

Single State Mode (original):

Sender                           Receiver
──────                           ────────
Wait REMOTE for READY_TO_SEND    Wait LOCAL for stepId
  (NVLink poll - slow)             (local poll - fast)
Copy data via NVLink             Copy from local buffer
Signal REMOTE with stepId        Signal LOCAL with READY_TO_SEND

Dual State Mode (new):

Sender                           Receiver
──────                           ────────
Wait LOCAL for ack value         Wait LOCAL for currentStep
  (local poll - fast)              (local poll - fast)
Copy data via NVLink             Copy from local buffer
Update local chunkRecvAck_       Ack sender via NVLink write
Signal peer via NVLink             with currentStep value

chunkRecvAck_ is a persistent local tracking array for sender to know what ack value to wait for from the receiver. This array tracks the expected ack value for each chunk. For both sender and receiver to agree on the same ack value, we have following requirement for user:

IMPORTANT: User Requirements for Dual State Mode

To achieve both correctness and efficiency, the dual-state mode requires users to maintain a global call_index counter:

  1. call_index must be globally unique across ALL send/recv calls, both within a kernel and across multiple kernel launches
  2. The same call_index must be passed to matching send/recv pairs

This requirement exists because:

  • The receiver ack (chunkRecvAck_) persists in device memory across kernel launches
  • Mismatched or overlapping call_index values can cause deadlock or data corruption

I acknowledge this sacrifices user experience for correctness and efficiency. Welcome suggestions for alternative designs.

Performance Results

Benchmarked using p2p_nvl_benchmark with bidirectional transfer: see details in test plan:

Mode Message Size Latency Reduction
Block 128KB-1GB 1-4%
Cluster 512MB-1GB 10-11%

Next

  • Is there a more user-friendly design for dual-chunkstate and still as efficient?
  • Integrate to Ctran SendRecv, alltoallv, send_one/multiple, dispatch to measure perf gain

Differential Revision: D91619478

Regina Ren and others added 2 commits January 29, 2026 22:52
… to 10% latency reduction in SendRecv bench

Summary:
This diff adds an experimental dual-chunkstate mode to P2pNvlTransportDevice that enables local busy-polling instead of remote NVLink polling, achieving up to 10% latency reduction.

## Motivation

In the original single-chunkstate design, the sender busy-waits on REMOTE memory via NVLink to check if the receiver has freed the buffer. This NVLink round-trip adds latency and may congest NVLink path, especially for larger message sizes.

## Solution: Dual-Chunkstate Mode

When `useDualStateBuffer = true`, we use two ChunkState buffers per peer:
- **myChunkStateBuffer**: Receiver waits here (sender writes via NVLink)
- **peerChunkStateBuffer**: Sender waits here (receiver acks via NVLink)

This allows BOTH sender and receiver to poll LOCAL memory, eliminating NVLink round-trip latency during busy-wait synchronization.

### State Machine Comparison

**Single State Mode** (original):
```
Sender                           Receiver
──────                           ────────
Wait REMOTE for READY_TO_SEND    Wait LOCAL for stepId
  (NVLink poll - slow)             (local poll - fast)
Copy data via NVLink             Copy from local buffer
Signal REMOTE with stepId        Signal LOCAL with READY_TO_SEND
```

**Dual State Mode** (new):
```
Sender                           Receiver
──────                           ────────
Wait LOCAL for ack value         Wait LOCAL for currentStep
  (local poll - fast)              (local poll - fast)
Copy data via NVLink             Copy from local buffer
Update local chunkRecvAck_       Ack sender via NVLink write
Signal peer via NVLink             with currentStep value
```
`chunkRecvAck_` is a persistent local tracking array for sender to know what ack value to wait for from the receiver. This array tracks the expected ack value for each chunk. For both sender and receiver to agree on the same ack value, we have following requirement for user:
## IMPORTANT: User Requirements for Dual State Mode

To achieve both correctness and efficiency, the dual-state mode requires users to maintain a global `call_index` counter:

1. **call_index must be globally unique** across ALL send/recv calls, both within a kernel and across multiple kernel launches
2. **The same call_index must be passed to matching send/recv pairs**

This requirement exists because:
- The receiver ack (`chunkRecvAck_`) persists in device memory across kernel launches
- Mismatched or overlapping call_index values can cause deadlock or data corruption

**I acknowledge this sacrifices user experience for correctness and efficiency. Welcome suggestions for alternative designs.**

## Performance Results

Benchmarked using `p2p_nvl_benchmark` with bidirectional transfer: see details in test plan:

| Mode    | Message Size | Latency Reduction |
|---------|-------------|-------------------|
| Block   | 128KB-1GB   | 1-4%              |
| Cluster | 512MB-1GB   | **10-11%**        |

## Next
- Is there a more user-friendly design for dual-chunkstate and still as efficient?
- Integrate to Ctran SendRecv, alltoallv, send_one/multiple, dispatch to measure perf gain

Differential Revision: D91619478
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 31, 2026

@Regina8023 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91619478.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant