Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/blog/posts/gpu-health-checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ A healthy instance is ready for workloads. A warning means you should monitor it

This release focuses on passive checks using DCGM background health checks. These run continuously and do not interrupt workloads.

For active checks today, you can run [NCCL tests](../../examples/clusters/nccl-tests/index.md) as a [distributed task](../../docs/concepts/tasks.md#distributed-tasks) to verify GPU-to-GPU communication and bandwidth across a fleet. Active tests like these can reveal network or interconnect issues that passive monitoring might miss. More built-in support for active diagnostics is planned.
For active checks today, you can run [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) as a [distributed task](../../docs/concepts/tasks.md#distributed-tasks) to verify GPU-to-GPU communication and bandwidth across a fleet. Active tests like these can reveal network or interconnect issues that passive monitoring might miss. More built-in support for active diagnostics is planned.

## Supported backends

Expand Down
2 changes: 1 addition & 1 deletion docs/blog/posts/mpi.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,5 +100,5 @@ as well as use MPI for other tasks.

!!! info "What's next?"
1. Learn more about [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
2. Check the [NCCL tests](../../examples/clusters/nccl-tests/index.md) example
2. Check the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) example
3. Join [Discord](https://discord.gg/u8SmfwPpMd)
2 changes: 1 addition & 1 deletion docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ Use `DSTACK_MASTER_NODE_IP`, `DSTACK_NODES_IPS`, `DSTACK_NODE_RANK`, and other

!!! info "MPI"
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
See the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) examples.

> For detailed examples, see [distributed training](../../examples.md#distributed-training) examples.

Expand Down
5 changes: 2 additions & 3 deletions docs/docs/guides/clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Within the task's `commands`, it's possible to use `DSTACK_MASTER_NODE_IP`, `DST

??? info "MPI"
If want to use MPI, you can set `startup_order` to `workers-first` and `stop_criteria` to `master-done`, and use `DSTACK_MPI_HOSTFILE`.
See the [NCCL](../../examples/clusters/nccl-tests/index.md) or [RCCL](../../examples/clusters/rccl-tests/index.md) examples.
See the [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) examples.

!!! info "Retry policy"
By default, if any of the nodes fails, `dstack` terminates the entire run. Configure a [retry policy](../concepts/tasks.md#retry-policy) to restart the run if any node fails.
Expand All @@ -59,8 +59,7 @@ Refer to [distributed tasks](../concepts/tasks.md#distributed-tasks) for an exam

## NCCL/RCCL tests

To test the interconnect of a created fleet, ensure you run [NCCL](../../examples/clusters/nccl-tests/index.md)
(for NVIDIA) or [RCCL](../../examples/clusters/rccl-tests/index.md) (for AMD) tests using MPI.
To test the interconnect of a created fleet, ensure you run [NCCL/RCCL tests](../../examples/clusters/nccl-rccl-tests/index.md) tests using MPI.

## Volumes

Expand Down
30 changes: 10 additions & 20 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,26 +80,6 @@ hide:
## Clusters

<div class="tx-landing__highlights_grid">
<a href="/examples/clusters/nccl-tests"
class="feature-cell sky">
<h3>
NCCL tests
</h3>

<p>
Run multi-node NCCL tests with MPI
</p>
</a>
<a href="/examples/clusters/rccl-tests"
class="feature-cell sky">
<h3>
RCCL tests
</h3>

<p>
Run multi-node RCCL tests with MPI
</p>
</a>
<a href="/examples/clusters/gcp"
class="feature-cell sky">
<h3>
Expand Down Expand Up @@ -130,6 +110,16 @@ hide:
Set up Crusoe clusters with optimized networking
</p>
</a>
<a href="/examples/clusters/nccl-rccl-tests"
class="feature-cell sky">
<h3>
NCCL/RCCL tests
</h3>

<p>
Run multi-node NCCL tests with MPI
</p>
</a>
</div>

## Inference
Expand Down
Empty file.
144 changes: 144 additions & 0 deletions examples/clusters/nccl-rccl-tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# NCCL/RCCL tests

This example shows how to run [NCCL](https://github.com/NVIDIA/nccl-tests) or [RCCL](https://github.com/ROCm/rccl-tests) tests on a cluster using [distributed tasks](https://dstack.ai/docs/concepts/tasks#distributed-tasks).

!!! info "Prerequisites"
Before running a distributed task, make sure to create a fleet with `placement` set to `cluster` (can be a [managed fleet](https://dstack.ai/docs/concepts/fleets#backend-placement) or an [SSH fleet](https://dstack.ai/docs/concepts/fleets#ssh-placement)).

## Running as a task

Here's an example of a task that runs AllReduce test on 2 nodes, each with 4 GPUs (8 processes in total).

=== "NCCL tests"

<div editor-title="examples/clusters/nccl-rccl-tests/nccl-tests.dstack.yml">

```yaml
type: task
name: nccl-tests

nodes: 2

startup_order: workers-first
stop_criteria: master-done

env:
- NCCL_DEBUG=INFO
commands:
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun \
--allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--bind-to none \
/opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
else
sleep infinity
fi

# Uncomment if the `kubernetes` backend requires it for `/dev/infiniband` access
#privileged: true

resources:
gpu: nvidia:1..8
shm_size: 16GB
```

</div>

!!! info "Default image"
If you don't specify `image`, `dstack` uses its [base](https://github.com/dstackai/dstack/tree/master/docker/base) Docker image pre-configured with
`uv`, `python`, `pip`, essential CUDA drivers, `mpirun`, and NCCL tests (under `/opt/nccl-tests/build`).

=== "RCCL tests"

<div editor-title="examples/clusters/nccl-rccl-tests/rccl-tests.dstack.yml">

```yaml
type: task
name: rccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

# Mount the system libraries folder from the host
volumes:
- /usr/local/lib:/mnt/lib

image: rocm/dev-ubuntu-22.04:6.4-complete
env:
- NCCL_DEBUG=INFO
- OPEN_MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi
commands:
# Setup MPI and build RCCL tests
- apt-get install -y git libopenmpi-dev openmpi-bin
- git clone https://github.com/ROCm/rccl-tests.git
- cd rccl-tests
- make MPI=1 MPI_HOME=$OPEN_MPI_HOME

# Preload the RoCE driver library from the host (for Broadcom driver compatibility)
- export LD_PRELOAD=/mnt/lib/libbnxt_re-rdmav34.so

# Run RCCL tests via MPI
- |
if [ $DSTACK_NODE_RANK -eq 0 ]; then
mpirun --allow-run-as-root \
--hostfile $DSTACK_MPI_HOSTFILE \
-n $DSTACK_GPUS_NUM \
-N $DSTACK_GPUS_PER_NODE \
--mca btl_tcp_if_include ens41np0 \
-x LD_PRELOAD \
-x NCCL_IB_HCA=mlx5_0/1,bnxt_re0,bnxt_re1,bnxt_re2,bnxt_re3,bnxt_re4,bnxt_re5,bnxt_re6,bnxt_re7 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_DISABLE=0 \
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 20 -c 0;
else
sleep infinity
fi

resources:
gpu: MI300X:8
```

</div>

!!! info "RoCE library"
Broadcom RoCE drivers require the `libbnxt_re` userspace library inside the container to be compatible with the host’s Broadcom
kernel driver `bnxt_re`. To ensure this compatibility, we mount `libbnxt_re-rdmav34.so` from the host and preload it
using `LD_PRELOAD` when running MPI.


!!! info "Privileged"
In some cases, the backend (e.g., `kubernetes`) may require `privileged: true` to access the high-speed interconnect (e.g., InfiniBand).

### Apply a configuration

To run a configuration, use the [`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply/) command.

<div class="termy">

```shell
$ dstack apply -f examples/clusters/nccl-rccl-tests/nccl-tests.dstack.yml

# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 aws us-east-1 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
2 aws us-west-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912
3 aws us-east-2 g4dn.12xlarge 48xCPU, 192GB, 4xT4 (16GB), 100.0GB (disk) no $3.912

Submit the run nccl-tests? [y/n]: y
```

</div>

## Source code

The source-code of this example can be found in
[`examples/clusters/nccl-rccl-tests`](https://github.com/dstackai/dstack/blob/master/examples/clusters/nccl-rccl-tests).

## What's next?

1. Check [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks),
[services](https://dstack.ai/docsconcepts/services), and [fleets](https://dstack.ai/docs/concepts/fleets).
83 changes: 0 additions & 83 deletions examples/clusters/nccl-tests/README.md

This file was deleted.

9 changes: 0 additions & 9 deletions examples/clusters/nccl-tests/fleet.dstack.yml

This file was deleted.

Loading