diff --git a/docs/docs/guides/kubernetes.md b/docs/docs/guides/kubernetes.md
index fa90e3c316..85dc22a80f 100644
--- a/docs/docs/guides/kubernetes.md
+++ b/docs/docs/guides/kubernetes.md
@@ -18,11 +18,11 @@ projects:
- name: main
backends:
- type: kubernetes
- kubeconfig:
- filename: ~/.kube/config
- proxy_jump:
- hostname: 204.12.171.137
- port: 32000
+ kubeconfig:
+ filename: ~/.kube/config
+ proxy_jump:
+ hostname: 204.12.171.137
+ port: 32000
```
diff --git a/docs/examples.md b/docs/examples.md
index 26b95b075a..a5d689e5ac 100644
--- a/docs/examples.md
+++ b/docs/examples.md
@@ -140,6 +140,16 @@ hide:
Set up AWS EFA clusters with optimized networking
+
+
+ Crusoe
+
+
+
+ Set up Crusoe clusters with optimized networking
+
+
## Inference
diff --git a/docs/examples/clusters/crusoe/index.md b/docs/examples/clusters/crusoe/index.md
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/examples/clusters/crusoe/README.md b/examples/clusters/crusoe/README.md
new file mode 100644
index 0000000000..7ba6f0ce25
--- /dev/null
+++ b/examples/clusters/crusoe/README.md
@@ -0,0 +1,293 @@
+# Crusoe
+
+Crusoe offers two ways to use clusters with fast interconnect:
+
+* [Kubernetes](#kubernetes) – Lets you interact with clusters through the Kubernetes API and includes support for NVIDIA GPU operators and related tools.
+* [Virtual Machines (VMs)](#vms) – Gives you direct access to clusters in the form of virtual machines.
+
+Both options use the same underlying networking infrastructure. This example walks you through how to set up Crusoe clusters to use with `dstack`.
+
+## Kubernetes
+
+!!! info "Prerequsisites"
+ 1. Go `Networking` → `Firewall Rules`, click `Create Firewall Rule`, and allow ingress traffic on port `30022`. This port will be used by the `dstack` server to access the jump host.
+ 2. Go to `Orchestration` and click `Create Cluster`. Make sure to enable the `NVIDIA GPU Operator` add-on.
+ 3. Go the the cluster, and click `Create Node Pool`. Select the right type of the instance. If you intend to auto-scale the cluster, make sure to set `Desired Number of Nodes` at least to `1`, since `dstack` doesn't currently support clusters that scale down to `0` nodes.
+ 4. Wait until at least one node is running.
+
+### Configure the backend
+
+Follow the standard instructions for setting up a [Kubernetes](https://dstack.ai/docs/concepts/backends/#kubernetes) backend:
+
+
+
+```yaml
+projects:
+ - name: main
+ backends:
+ - type: kubernetes
+ kubeconfig:
+ filename:
+ proxy_jump:
+ port: 30022
+```
+
+
+
+### Create a fleet
+
+Once the Kubernetes cluster and the `dstack` server are running, you can create a fleet:
+
+
+
+```yaml
+type: fleet
+name: crusoe-fleet
+
+placement: cluster
+nodes: 0..
+
+backends: [kubernetes]
+
+resources:
+ # Specify requirements to filter nodes
+ gpu: 1..8
+```
+
+
+
+Pass the fleet configuration to `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f crusoe-fleet.dstack.yml
+```
+
+
+
+Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
+
+## VMs
+
+Another way to work with Crusoe clusters is through VMs. While `dstack` typically supports VM-based compute providers via [dedicated backends](https://dstack.ai/docs/concepts/backends#vm-based) that automate provisioning, Crusoe does not yet have [such a backend](https://github.com/dstackai/dstack/issues/3378). As a result, to use a VM-based Crusoe cluster with `dstack`, you should use [SSH fleets](https://dstack.ai/docs/concepts/fleets).
+
+!!! info "Prerequsisites"
+ 1. Go to `Compute`, then `Instances`, and click `Create Instance`. Make sure to select the right instance type and VM image (that [support interconnect](https://docs.crusoecloud.com/networking/infiniband/managing-infiniband-networks/index.html)). Make sure to create as many instances as needed.
+
+### Create a fleet
+
+Follow the standard instructions for setting up an [SSH fleet](https://dstack.ai/docs/concepts/fleets/#ssh-fleets):
+
+
+
+```yaml
+type: fleet
+name: crusoe-fleet
+
+placement: cluster
+
+# SSH credentials for the on-prem servers
+ssh_config:
+ user: ubuntu
+ identity_file: ~/.ssh/id_rsa
+ hosts:
+ - 3.255.177.51
+ - 3.255.177.52
+```
+
+
+
+Pass the fleet configuration to `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f crusoe-fleet.dstack.yml
+```
+
+
+
+Once the fleet is created, you can run [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), and [services](https://dstack.ai/docs/concepts/services).
+
+## Run NCCL tests
+
+Use a [distributed task](https://dstack.ai/docs/concepts/tasks#distributed-task) that runs NCCL tests to validate cluster network bandwidth.
+
+=== "Kubernetes"
+
+ If you’re running on Crusoe’s Kubernetes, make sure to install HPC-X and provide an up-to-date topology file.
+
+
+
+ ```yaml
+ type: task
+ name: nccl-tests
+
+ nodes: 2
+ startup_order: workers-first
+ stop_criteria: master-done
+
+ commands:
+ # Install NCCL topology files
+ - curl -sSL https://gist.github.com/un-def/48df8eea222fa9547ad4441986eb15af/archive/df51d56285c5396a0e82bb42f4f970e7bb0a9b65.tar.gz -o nccl_topo.tar.gz
+ - mkdir -p /etc/crusoe/nccl_topo
+ - tar -C /etc/crusoe/nccl_topo -xf nccl_topo.tar.gz --strip-components=1
+ # Install and initialize HPC-X
+ - curl -sSL https://content.mellanox.com/hpc/hpc-x/v2.21.3/hpcx-v2.21.3-gcc-doca_ofed-ubuntu22.04-cuda12-x86_64.tbz -o hpcx.tar.bz
+ - mkdir -p /opt/hpcx
+ - tar -C /opt/hpcx -xf hpcx.tar.bz --strip-components=1 --checkpoint=10000
+ - . /opt/hpcx/hpcx-init.sh
+ - hpcx_load
+ # Run NCCL Tests
+ - |
+ if [ $DSTACK_NODE_RANK -eq 0 ]; then
+ mpirun \
+ --allow-run-as-root \
+ --hostfile $DSTACK_MPI_HOSTFILE \
+ -n $DSTACK_GPUS_NUM \
+ -N $DSTACK_GPUS_PER_NODE \
+ --bind-to none \
+ -mca btl tcp,self \
+ -mca coll_hcoll_enable 0 \
+ -x PATH \
+ -x LD_LIBRARY_PATH \
+ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
+ -x NCCL_SOCKET_NTHREADS=4 \
+ -x NCCL_NSOCKS_PERTHREAD=8 \
+ -x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
+ -x NCCL_IB_MERGE_VFS=0 \
+ -x NCCL_IB_AR_THRESHOLD=0 \
+ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
+ -x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
+ -x NCCL_IB_QPS_PER_CONNECTION=2 \
+ -x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
+ -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
+ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
+ else
+ sleep infinity
+ fi
+
+ # Required for IB
+ privileged: true
+
+ resources:
+ gpu: A100:8
+ shm_size: 16GB
+ ```
+
+
+
+ > The task above downloads an A100 topology file from a Gist. The most reliable way to obtain the latest topology is to copy it from a Crusoe-provisioned VM (see [VMs](#vms)).
+
+ ??? info "Privileged"
+ When running on Kubernetes, set `privileged` to `true` to ensure access to InfiniBand.
+
+=== "SSH fleets"
+
+With Crusoe VMs, HPC-X and up-to-date topology files are already available on the hosts. When using SSH fleets, simply mount them via [instance volumes](https://dstack.ai/docs/concepts/volumes#instance-volumes).
+
+```yaml
+type: task
+name: nccl-tests
+
+nodes: 2
+startup_order: workers-first
+stop_criteria: master-done
+
+volumes:
+ - /opt/hpcx:/opt/hpcx
+ - /etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo
+
+commands:
+ - . /opt/hpcx/hpcx-init.sh
+ - hpcx_load
+ # Run NCCL Tests
+ - |
+ if [ $DSTACK_NODE_RANK -eq 0 ]; then
+ mpirun \
+ --allow-run-as-root \
+ --hostfile $DSTACK_MPI_HOSTFILE \
+ -n $DSTACK_GPUS_NUM \
+ -N $DSTACK_GPUS_PER_NODE \
+ --bind-to none \
+ -mca btl tcp,self \
+ -mca coll_hcoll_enable 0 \
+ -x PATH \
+ -x LD_LIBRARY_PATH \
+ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
+ -x NCCL_SOCKET_NTHREADS=4 \
+ -x NCCL_NSOCKS_PERTHREAD=8 \
+ -x NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/a100-80gb-sxm-ib-cloud-hypervisor.xml \
+ -x NCCL_IB_MERGE_VFS=0 \
+ -x NCCL_IB_AR_THRESHOLD=0 \
+ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
+ -x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
+ -x NCCL_IB_QPS_PER_CONNECTION=2 \
+ -x NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
+ -x UCX_NET_DEVICES=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1 \
+ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -t 1 -g 1 -c 1 -n 100
+ else
+ sleep infinity
+ fi
+
+resources:
+ gpu: A100:8
+ shm_size: 16GB
+```
+
+Pass the configuration to `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f crusoe-nccl-tests.dstack.yml
+
+Provisioning...
+---> 100%
+
+nccl-tests provisioning completed (running)
+
+# out-of-place in-place
+# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
+# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
+ 8 2 float sum -1 27.70 0.00 0.00 0 29.82 0.00 0.00 0
+ 16 4 float sum -1 28.78 0.00 0.00 0 28.99 0.00 0.00 0
+ 32 8 float sum -1 28.49 0.00 0.00 0 28.16 0.00 0.00 0
+ 64 16 float sum -1 28.41 0.00 0.00 0 28.69 0.00 0.00 0
+ 128 32 float sum -1 28.94 0.00 0.01 0 28.58 0.00 0.01 0
+ 256 64 float sum -1 29.46 0.01 0.02 0 29.45 0.01 0.02 0
+ 512 128 float sum -1 30.23 0.02 0.03 0 29.85 0.02 0.03 0
+ 1024 256 float sum -1 30.79 0.03 0.06 0 34.03 0.03 0.06 0
+ 2048 512 float sum -1 37.90 0.05 0.10 0 33.22 0.06 0.12 0
+ 4096 1024 float sum -1 35.91 0.11 0.21 0 35.30 0.12 0.22 0
+ 8192 2048 float sum -1 36.84 0.22 0.42 0 38.30 0.21 0.40 0
+ 16384 4096 float sum -1 47.08 0.35 0.65 0 37.26 0.44 0.82 0
+ 32768 8192 float sum -1 45.20 0.72 1.36 0 48.70 0.67 1.26 0
+ 65536 16384 float sum -1 49.43 1.33 2.49 0 50.97 1.29 2.41 0
+ 131072 32768 float sum -1 51.08 2.57 4.81 0 50.17 2.61 4.90 0
+ 262144 65536 float sum -1 192.78 1.36 2.55 0 100.00 2.62 4.92 0
+ 524288 131072 float sum -1 68.02 7.71 14.45 0 69.40 7.55 14.16 0
+ 1048576 262144 float sum -1 81.71 12.83 24.06 0 88.58 11.84 22.20 0
+ 2097152 524288 float sum -1 113.03 18.55 34.79 0 102.21 20.52 38.47 0
+ 4194304 1048576 float sum -1 123.50 33.96 63.68 0 131.71 31.84 59.71 0
+ 8388608 2097152 float sum -1 189.42 44.29 83.04 0 183.01 45.84 85.95 0
+ 16777216 4194304 float sum -1 274.05 61.22 114.79 0 265.91 63.09 118.30 0
+ 33554432 8388608 float sum -1 490.77 68.37 128.20 0 490.53 68.40 128.26 0
+ 67108864 16777216 float sum -1 854.62 78.52 147.23 0 853.49 78.63 147.43 0
+ 134217728 33554432 float sum -1 1483.43 90.48 169.65 0 1479.22 90.74 170.13 0
+ 268435456 67108864 float sum -1 2700.36 99.41 186.39 0 2700.49 99.40 186.38 0
+ 536870912 134217728 float sum -1 5300.49 101.29 189.91 0 5314.91 101.01 189.40 0
+ 1073741824 268435456 float sum -1 10472.2 102.53 192.25 0 10485.6 102.40 192.00 0
+ 2147483648 536870912 float sum -1 20749.1 103.50 194.06 0 20745.7 103.51 194.09 0
+# Out of bounds values : 0 OK
+# Avg bus bandwidth : 53.7387
+```
+
+
+
+## What's next
+
+1. Learn about [dev environments](https://dstack.ai/docs/concepts/dev-environments), [tasks](https://dstack.ai/docs/concepts/tasks), [services](https://dstack.ai/docs/concepts/services)
+2. Read the [Kuberentes](https://dstack.ai/docs/guides/kubernetes), and [Clusters](https://dstack.ai/docs/guides/clusters) guides
+3. Check Crusoe's docs on [networking](https://docs.crusoecloud.com/networking/infiniband/) and [Kubernetes](https://docs.crusoecloud.com/orchestration/cmk/index.html)
diff --git a/mkdocs.yml b/mkdocs.yml
index 1652c62bbb..fa67b333e1 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -326,6 +326,7 @@ nav:
- GCP A3 Mega: examples/clusters/a3mega/index.md
- GCP A3 High: examples/clusters/a3high/index.md
- AWS EFA: examples/clusters/efa/index.md
+ - Crusoe: examples/clusters/crusoe/index.md
- Inference:
- SGLang: examples/inference/sglang/index.md
- vLLM: examples/inference/vllm/index.md