diff --git a/docs/blog/archive/ambassador-program.md b/docs/blog/archive/ambassador-program.md deleted file mode 100644 index 778f2cb37d..0000000000 --- a/docs/blog/archive/ambassador-program.md +++ /dev/null @@ -1,65 +0,0 @@ ---- -title: "Get involved as a community ambassador" -date: 2024-12-18 -description: "Join dstack as an ambassador to grow the community, share knowledge, and help others use dstack." -slug: ambassador-program -image: https://dstack.ai/static-assets/static-assets/images/ambassador-program.png -categories: - - Community ---- - -# Get involved as a community ambassador - -As we wrap up an exciting year at `dstack`, we’re thrilled to introduce our Ambassador Program. This initiative invites AI -infrastructure enthusiasts and those passionate about open-source AI to share their knowledge, contribute to the growth -of the `dstack` community, and play a key role in advancing the open AI ecosystem. - -[//]: # (What community is about:) -[//]: # (- Open-source) -[//]: # (- AI infrastructure) -[//]: # (- AI containers) -[//]: # (- Openness) - -[//]: # (Mention:) -[//]: # (Who we are looking for) - - - - - -## What will you do as an ambassador? - -As an ambassador, you’ll play a vital role in sharing best practices for using containers in AI workflows, advocating -for open-source tools for AI model training and inference, and helping the community use `dstack` with -various cloud providers, data centers, and GPU vendors. - -Your contributions might include writing technical blog posts, delivering talks, organizing `dstack` meetups, and -championing the open AI ecosystem within the broader community. - -## Who is the program for? - -Whether you’re new to `dstack` or already experienced, the ambassador program is open to anyone passionate -about open-source AI, eager to share knowledge, and excited to engage with the AI community. - -## How do we support ambassadors? - -At `dstack`, we are committed to supporting ambassadors through recognition, amplifying their content, and providing -cloud GPU credits to power their projects. - -## How to apply? - -If you’re interested in becoming an ambassador, fill out a quick form with details about -yourself and your experience. We’ll reach out with a starter kit and next steps. - - - Get involved - - -Have questions? Reach out via [Discord](https://discord.gg/u8SmfwPpMd)! - -> 💜 In the meantime, we’re thrilled to -> welcome [Park Chansung](https://x.com/algo_diver), the -> first `dstack` ambassador. diff --git a/docs/blog/archive/efa.md b/docs/blog/archive/efa.md deleted file mode 100644 index 6841cd976b..0000000000 --- a/docs/blog/archive/efa.md +++ /dev/null @@ -1,173 +0,0 @@ ---- -title: Efficient distributed training with AWS EFA -date: 2025-02-20 -description: "The latest release of dstack allows you to use AWS EFA for your distributed training tasks." -slug: efa -image: https://dstack.ai/static-assets/static-assets/images/distributed-training-with-aws-efa-v2.png -categories: - - Cloud fleets ---- - -# Efficient distributed training with AWS EFA - -[Amazon Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) is a high-performance network interface designed for AWS EC2 instances, enabling -ultra-low latency and high-throughput communication between nodes. This makes it an ideal solution for scaling -distributed training workloads across multiple GPUs and instances. - -With the latest release of `dstack`, you can now leverage AWS EFA to supercharge your distributed training tasks. - - - - - -## About EFA - -AWS EFA delivers up to 400 Gbps of bandwidth, enabling lightning-fast GPU-to-GPU communication across nodes. By -bypassing the kernel and providing direct network access, EFA minimizes latency and maximizes throughput. Its native -integration with the `nccl` library ensures optimal performance for large-scale distributed training. - -With EFA, you can scale your training tasks to thousands of nodes. - -To use AWS EFA with `dstack`, follow these steps to run your distributed training tasks. - -## Configure the backend - -Before using EFA, ensure the `aws` backend is properly configured. - -If you're using P4 or P5 instances with multiple -network interfaces, you’ll need to disable public IPs. Note, the `dstack` -server in this case should have access to the private subnet of the VPC. - -You’ll also need to specify an AMI that includes the GDRCopy drivers. For example, you can use the -[AWS Deep Learning Base GPU AMI](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/). - -Here’s an example backend configuration: - - - -
- -```yaml -projects: -- name: main - backends: - - type: aws - creds: - type: default - regions: ["us-west-2"] - public_ips: false - vpc_name: my-vpc - os_images: - nvidia: - name: Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20241115 - owner: 898082745236 - user: ubuntu -``` - -
- -## Create a fleet - -Once the backend is configured, you can create a fleet for distributed training. Here’s an example fleet -configuration: - -
- - ```yaml - type: fleet - name: my-efa-fleet - - # Specify the number of instances - nodes: 2 - placement: cluster - - resources: - gpu: H100:8 - ``` - -
- -To provision the fleet, use the [`dstack apply`](../../docs/reference/cli/dstack/apply.md): - -
- -```shell -$ dstack apply -f examples/misc/efa/fleet.dstack.yml - -Provisioning... ----> 100% - - FLEET INSTANCE BACKEND GPU PRICE STATUS CREATED - my-efa-fleet 0 aws (us-west-2) 8xH100:80GB $98.32 idle 3 mins ago - 1 aws (us-west-2) 8xH100:80GB $98.32 idle 3 mins ago -``` - -
- -## Submit the task - -With the fleet provisioned, you can now submit your distributed training task. Here’s an example task configuration: - -
- -```yaml -type: task -name: efa-task - -# The size of the cluster -nodes: 2 - -python: 3.12 - -# Commands to run on each node -commands: - - pip install requirements.txt - - accelerate launch - --num_processes $DSTACK_NODES_NUM - --num_machines $DSTACK_NODES_NUM - --machine_rank $DSTACK_NODE_RANK - --main_process_ip $DSTACK_MASTER_NODE_IP - --main_process_port 29500 - task.py - -env: - - LD_LIBRARY_PATH=/opt/nccl/build/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH - - FI_PROVIDER=efa - - FI_EFA_USE_HUGE_PAGE=0 - - OMPI_MCA_pml=^cm,ucx - - NCCL_TOPO_FILE=/opt/amazon/efa/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml # Typically loaded automatically, might not be necessary - - OPAL_PREFIX=/opt/amazon/openmpi - - NCCL_SOCKET_IFNAME=^docker0,lo - - FI_EFA_USE_DEVICE_RDMA=1 - - NCCL_DEBUG=INFO # Optional debugging for NCCL communication - - NCCL_DEBUG_SUBSYS=TUNING - -resources: - gpu: H100:8 - shm_size: 24GB -``` - -
- -Submit the task using the [`dstack apply`](../../docs/reference/cli/dstack/apply.md): - -
- -```shell -$ dstack apply -f examples/misc/efa/task.dstack.yml -R -``` - -
- -`dstack` will automatically run the container on each node of the cluster, passing the necessary environment variables. -`nccl` will leverage the EFA drivers and the specified environment variables to enable high-performance communication via -EFA. - -> Have questions? You're welcome to join -> our [Discord](https://discord.gg/u8SmfwPpMd) or talk -> directly to [our team](https://calendly.com/dstackai/discovery-call). - -!!! info "What's next?" - 1. Check [fleets](../../docs/concepts/fleets.md), [tasks](../../docs/concepts/tasks.md), and [volumes](../../docs/concepts/volumes.md) - 2. Also see [dev environments](../../docs/concepts/dev-environments.md) and [services](../../docs/concepts/services.md) - 3. Join [Discord](https://discord.gg/u8SmfwPpMd) diff --git a/docs/blog/posts/changelog-07-25.md b/docs/blog/posts/changelog-07-25.md index 909fa28593..a065ef37c7 100644 --- a/docs/blog/posts/changelog-07-25.md +++ b/docs/blog/posts/changelog-07-25.md @@ -144,7 +144,7 @@ resources: #### AWS EFA -EFA is a network interface for EC2 that enables low-latency, high-bandwidth communication between nodes—crucial for scaling distributed deep learning. With `dstack`, EFA is automatically enabled when using supported instance types in fleets. Check out our [example](../../examples/clusters/efa/index.md) +EFA is a network interface for EC2 that enables low-latency, high-bandwidth communication between nodes—crucial for scaling distributed deep learning. With `dstack`, EFA is automatically enabled when using supported instance types in fleets. Check out our [example](../../examples/clusters/aws/index.md) #### Default Docker images diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md index 9a22fada75..99912cd75b 100644 --- a/docs/docs/concepts/fleets.md +++ b/docs/docs/concepts/fleets.md @@ -107,7 +107,7 @@ This ensures all instances are provisioned with optimal inter-node connectivity. Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. Otherwise, instances are only connected by the default VPC subnet. - Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details. + Refer to the [AWS](../../examples/clusters/aws/index.md) example for more details. ??? info "GCP" When you create a fleet with GCP, `dstack` automatically configures [GPUDirect-TCPXO and GPUDirect-TCPX](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot) networking for the A3 Mega and A3 High instance types, as well as RoCE networking for the A4 instance type. diff --git a/docs/docs/guides/clusters.md b/docs/docs/guides/clusters.md index 81e356edf4..311db648c0 100644 --- a/docs/docs/guides/clusters.md +++ b/docs/docs/guides/clusters.md @@ -22,7 +22,7 @@ For cloud fleets, fast interconnect is currently supported only on the `aws`, `g !!! info "Backend configuration" Note, EFA requires the `public_ips` to be set to `false` in the `aws` backend configuration. - Refer to the [EFA](../../examples/clusters/efa/index.md) example for more details. + Refer to the [AWS](../../examples/clusters/aws/index.md) example for more details. === "GCP" When you create a cloud fleet with GCP, `dstack` automatically configures [GPUDirect-TCPXO and GPUDirect-TCPX](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot) networking for the A3 Mega and A3 High instance types, as well as RoCE networking for the A4 instance type. diff --git a/docs/examples.md b/docs/examples.md index b6fbe7d470..9f26bd0a2f 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -113,7 +113,7 @@ hide:

- AWS EFA + AWS

diff --git a/docs/examples/clusters/efa/index.md b/docs/examples/clusters/aws/index.md similarity index 100% rename from docs/examples/clusters/efa/index.md rename to docs/examples/clusters/aws/index.md diff --git a/examples/clusters/efa/README.md b/examples/clusters/aws/README.md similarity index 90% rename from examples/clusters/efa/README.md rename to examples/clusters/aws/README.md index 98198a8388..f7cb622704 100644 --- a/examples/clusters/efa/README.md +++ b/examples/clusters/aws/README.md @@ -1,4 +1,4 @@ -# AWS EFA +# AWS In this guide, we’ll walk through how to run high-performance distributed training on AWS using [Amazon Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) with `dstack`. @@ -37,11 +37,11 @@ projects: Once your backend is ready, define a fleet configuration. -

+
```yaml type: fleet - name: my-efa-fleet + name: efa-fleet nodes: 2 placement: cluster @@ -57,14 +57,14 @@ Provision the fleet with `dstack apply`:
```shell -$ dstack apply -f examples/clusters/efa/fleet.dstack.yml +$ dstack apply -f examples/clusters/aws/efa-fleet.dstack.yml Provisioning... ---> 100% - FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED - my-efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago - 1 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago + FLEET INSTANCE BACKEND INSTANCE TYPE GPU PRICE STATUS CREATED + efa-fleet 0 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago + 1 aws (us-west-2) p4d.24xlarge H100:8:80GB $98.32 idle 3 mins ago ```
@@ -76,7 +76,7 @@ Provisioning... ```yaml type: fleet - name: my-efa-fleet + name: efa-fleet nodes: 2 placement: cluster diff --git a/examples/clusters/efa/fleet.dstack.yml b/examples/clusters/aws/fleet.dstack.yml similarity index 100% rename from examples/clusters/efa/fleet.dstack.yml rename to examples/clusters/aws/fleet.dstack.yml diff --git a/mkdocs.yml b/mkdocs.yml index ef95e9c8dd..7a657b1fd4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -145,11 +145,10 @@ plugins: 'docs/examples/deployment/tgi/index.md': 'examples/inference/tgi/index.md' 'providers.md': 'partners.md' 'backends.md': 'partners.md' - 'blog/ambassador-program.md': 'blog/archive/ambassador-program.md' 'blog/monitoring-gpu-usage.md': 'blog/posts/dstack-metrics.md' 'blog/inactive-dev-environments-auto-shutdown.md': 'blog/posts/inactivity-duration.md' 'blog/data-centers-and-private-clouds.md': 'blog/posts/gpu-blocks-and-proxy-jump.md' - 'blog/distributed-training-with-aws-efa.md': 'examples/clusters/efa/index.md' + 'blog/distributed-training-with-aws-efa.md': 'examples/clusters/aws/index.md' 'blog/dstack-stats.md': 'blog/posts/dstack-metrics.md' 'docs/concepts/metrics.md': 'docs/guides/metrics.md' 'docs/guides/monitoring.md': 'docs/guides/metrics.md' @@ -166,11 +165,12 @@ plugins: 'examples/deployment/trtllm/index.md': 'examples/inference/trtllm/index.md' 'examples/fine-tuning/trl/index.md': 'examples/single-node-training/trl/index.md' 'examples/fine-tuning/axolotl/index.md': 'examples/single-node-training/axolotl/index.md' - 'blog/efa.md': 'examples/clusters/efa/index.md' + 'blog/efa.md': 'examples/clusters/aws/index.md' 'docs/concepts/repos.md': 'docs/concepts/dev-environments.md#repos' 'examples/clusters/a3high/index.md': 'examples/clusters/gcp/index.md' 'examples/clusters/a3mega/index.md': 'examples/clusters/gcp/index.md' 'examples/clusters/a4/index.md': 'examples/clusters/gcp/index.md' + 'examples/clusters/efa/index.md': 'examples/clusters/aws/index.md' - typeset - gen-files: scripts: # always relative to mkdocs.yml @@ -326,7 +326,7 @@ nav: - NCCL tests: examples/clusters/nccl-tests/index.md - RCCL tests: examples/clusters/rccl-tests/index.md - GCP: examples/clusters/gcp/index.md - - AWS EFA: examples/clusters/efa/index.md + - AWS: examples/clusters/aws/index.md - Crusoe: examples/clusters/crusoe/index.md - Inference: - SGLang: examples/inference/sglang/index.md