Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# DeepXTrace

DeepXTrace is a lightweight system tool designed to efficiently and precisely locate slow ranks in DeepEP-based environments by enhancing the [DeepEP](https://github.com/deepseek-ai/DeepEP) communication library. It is composed of two core components: *DeepEP Metrics Probe* and *DeepXTrace Metrics Analysis*.
DeepXTrace is a lightweight diagnostic tool designed to efficiently and precisely locate slow ranks in MoE-based distributed environments through instrumentation of communication libraries (e.g., [DeepEP for GPU](https://github.com/deepseek-ai/DeepEP), [MC2 for NPU](https://gitcode.com/cann/ops-transformer)). It is composed of two core components: *MoE COMM Metrics Probe* and *DeepXTrace Metrics Analysis*.

DeepXTrace supports diagnosis of various slowdown scenarios, including:

* *Comp-Slow*: Slowdown caused by the destination rank(e.g., GPU/CPU compute latency).
* *Comp-Slow*: Slowdown caused by the destination rank (e.g., xPU compute latency).
* *Mixed-Slow*: Slowdown caused by the source rank(e.g., uneven expert distribution or hotspot congestion).
* *Comm-Slow*: Slowdown caused by the communication path between specific source and destination ranks(e.g., communication link issues).

Expand All @@ -21,13 +21,15 @@ The following figure shows the latency matrix for the Combine operator's token r

![combine](figures/combine.png)

## DeepEP-Metrics-Probe
## MoE-COMM-Metrics-Probe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Consider aligning the section title with the earlier component name for consistency.

Earlier in the README this is called MoE COMM Metrics Probe, but this header uses MoE-COMM-Metrics-Probe. Please pick one form (hyphenated or spaced) and use it consistently so it’s clear they refer to the same component.

Suggested implementation:

##  MoE COMM Metrics Probe

If there are other mentions of this component elsewhere in the README (e.g., in introductions, diagrams, or bullet lists) using a different variant (MoE-COMM-Metrics-Probe, MoE COMM Metrics-Probe, etc.), they should also be updated to MoE COMM Metrics Probe for full consistency.


A low-overhead module for measuring critical diagnostic indicators during DeepEP communication. See also: [DeepEP Diagnose PR](https://github.com/deepseek-ai/DeepEP/pull/311).
A low-overhead module for measuring critical diagnostic indicators during MoE communication. Supported Implementations:
- **DeepEP (GPU)**: Integrated metrics probe via [DeepEP Diagnose PR #311](https://github.com/deepseek-ai/DeepEP/pull/311)
- **MC2 (NPU)**: Native instrumentation through [MC2 Diagnose PR #288](https://gitcode.com/cann/ops-transformer/pull/288). See also [Ascend and DeepXTrace Blog](https://mp.weixin.qq.com/s/AaZ3pgM-brWw8-DMxS54Wg)

## DeepXTrace-Metrics-Analysis

An analysis module that locates the slow rank issues by processing the collected metrics.
A cross-platform analysis module that identifies slow-rank bottlenecks across GPU/NPU clusters through metric processing.

### Build
```shell
Expand Down