Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions fine-tune.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,17 @@ For a list of working configuration examples, check out the [Axolotl examples re

Your training environment is located in the `/workspace/fine-tuning/` directory and has the following structure:

* `examples/`: Sample configurations and scripts.
* `outputs/`: Where your training results and model outputs will be saved.
* `config.yaml`: The main configuration file for your training parameters.
<Tree>
<Tree.Folder name="workspace" defaultOpen>
<Tree.Folder name="fine-tuning" defaultOpen>
<Tree.Folder name="examples" defaultOpen comment="Sample configurations and scripts" />
<Tree.Folder name="outputs" defaultOpen comment="Where your training results and model outputs will be saved" />
<Tree.File name="config.yaml" defaultOpen comment="The main configuration file for your training parameters" />
</Tree.Folder>
</Tree.Folder>
</Tree>

`/examples/` contains sample configurations and scripts, `/outputs/` contains your training results and model outputs, and `/config.yaml/` is the main configuration file for your training parameters.

The system generates an initial `config.yaml` based on your selected base model and dataset. This is where you define all the hyperparameters for your fine-tuning job. You may need to experiment with these settings to achieve the best results.

Expand Down
12 changes: 8 additions & 4 deletions get-started.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ sidebarTitle: "Quickstart"
description: "Run code on a remote GPU in minutes."
---

Follow this guide to learn how to create an account, deploy your first GPU Pod, and use it to execute code remotely.
import { PodTooltip, NetworkVolumeTooltip, TemplateTooltip } from "/snippets/tooltips.jsx";

Follow this guide to learn how to create an account, deploy your first GPU <PodTooltip />, and use it to execute code remotely.

## Step 1: Create an account

Expand Down Expand Up @@ -46,7 +48,7 @@ Take a minute to explore the other tabs:
- **Details**: Information about your Pod, such as hardware specs, pricing, and storage.
- **Telemetry**: Realtime utilization metrics for your Pod's CPU, memory, and storage.
- **Logs**: Logs streamed from your container (including stdout from any applications inside) and the Pod management system.
- **Template Readme**: Details about the template your Pod is running. Your Pod is configured with the latest official Runpod PyTorch template.
- **Template Readme**: Details about the <TemplateTooltip /> your Pod is running. Your Pod is configured with the latest official Runpod PyTorch template.

## Step 4: Execute code on your Pod with JupyterLab

Expand All @@ -55,7 +57,9 @@ Take a minute to explore the other tabs:
3. Type `print("Hello, world!")` in the first line of the notebook.
4. Click the play button to run your code.

And that's it—congrats! You just ran your first line of code on Runpod.
<Check>
Congratulations! You just ran your first line of code on Runpod.
</Check>

## Step 5: Clean up

Expand All @@ -74,7 +78,7 @@ To terminate your Pod:

<Warning>

Terminating a Pod permanently deletes all data that isn't stored in a [network volume](/storage/network-volumes). Be sure that you've saved any data you might need to access again.
Terminating a Pod permanently deletes all data that isn't stored in a <NetworkVolumeTooltip />. Be sure that you've saved any data you might need to access again.

To learn more about how storage works, see the [Pod storage overview](/pods/storage/types).

Expand Down
4 changes: 3 additions & 1 deletion get-started/api-keys.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ title: "Manage API keys"
description: "Learn how to create, edit, and disable Runpod API keys."
---

import { ServerlessTooltip } from "/snippets/tooltips.jsx";

<Note>

Legacy API keys generated before November 11, 2024 have either Read/Write or Read Only access to GraphQL based on what was set for that key. All legacy keys have full access to AI API. To improve security, generate a new key with **Restricted** permission and select the minimum permission needed for your use case.
Expand All @@ -20,7 +22,7 @@ Follow these steps to create a new Runpod API key:
3. Give your key a name and set its permissions (**All**, **Restricted**, or **Read Only**). If you choose **Restricted**, you can customize access for each Runpod API:

* **None**: No access
* **Restricted**: Customize access for each of your [Serverless endpoints](/serverless/overview). (Default: None.)
* **Restricted**: Customize access for each of your <ServerlessTooltip /> endpoints. (Default: None.)
* **Read/Write**: Full access to your endpoints.
* **Read Only**: Read access without write access.

Expand Down
4 changes: 3 additions & 1 deletion get-started/concepts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ title: "Concepts"
description: "Key concepts and terminology for understanding Runpod's platform and products."
---

import { PodsTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx";

## [Runpod console](https://console.runpod.io)

The web interface for managing your compute resources, account, teams, and billing.
Expand All @@ -25,7 +27,7 @@ A managed compute cluster with high-speed networking for multi-node distributed

## [Network volume](/storage/network-volumes)

Persistent storage that exists independently of your other compute resources and can be attached to multiple Pods or Serverless endpoints to share data between machines.
Persistent storage that exists independently of your other compute resources and can be attached to multiple <PodsTooltip /> or <ServerlessTooltip /> endpoints to share data between machines.

## [S3-compatible API](/storage/s3-api)

Expand Down
6 changes: 4 additions & 2 deletions get-started/connect-to-runpod.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,13 @@ title: "Choose a workflow"
description: "Review the available methods for accessing and managing Runpod resources."
---

import { PodsTooltip, EndpointTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx";

Runpod offers multiple ways to access and manage your compute resources. Choose the method that best fits your workflow:

## Runpod console

The Runpod console provides an intuitive web interface to manage Pods and endpoints, access Pod terminals, send endpoint requests, monitor resource usage, and view billing and usage history.
The Runpod console provides an intuitive web interface to manage <PodsTooltip /> and <EndpointTooltip />s, access Pod terminals, send endpoint requests, monitor resource usage, and view billing and usage history.

[Launch the Runpod console →](https://www.console.runpod.io)

Expand All @@ -19,7 +21,7 @@ You can connect directly to your running Pods and execute code on them using a v

## REST API

The Runpod REST API allows you to programmatically manage and control compute resources. Use the API to manage Pod lifecycles and Serverless endpoints, monitor resource utilization, and integrate Runpod into your applications.
The Runpod REST API allows you to programmatically manage and control compute resources. Use the API to manage Pod lifecycles and <ServerlessTooltip /> endpoints, monitor resource utilization, and integrate Runpod into your applications.

[Explore the API reference →](/api-reference/docs/GET/openapi-json)

Expand Down
4 changes: 3 additions & 1 deletion get-started/manage-accounts.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,15 @@ title: "Manage accounts"
description: "Create accounts, manage teams, and configure user permissions in Runpod."
---

import { PodsTooltip, ServerlessTooltip, InstantClusterTooltip, NetworkVolumeTooltip } from "/snippets/tooltips.jsx";

To access Runpod resources, you need to either create your own account or join an existing team through an invitation. This guide explains how to set up and manage accounts, teams, and user roles.

## Create an account

Sign up for a Runpod account at [console.runpod.io/signup](https://www.console.runpod.io/signup).

Once created, you can use your account to deploy Pods, create Serverless endpoints, and access other Runpod services. Personal accounts can be converted to team accounts at any time to enable collaboration features.
Once created, you can use your account to deploy <PodsTooltip />, create <ServerlessTooltip /> endpoints, and access other Runpod services. Personal accounts can be converted to team accounts at any time to enable collaboration features.

## Convert to a team account

Expand Down
10 changes: 6 additions & 4 deletions get-started/products.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,25 @@ sidebarTitle: "Product overview"
description: "Explore Runpod's major offerings and find the right solution for your workload."
---

Runpod offers cloud computing resources for AI and machine learning workloads. You can choose from instant GPUs for development, auto-scaling Serverless computing, pre-deployed AI models, or multi-node clusters for distributed training.
import { ServerlessTooltip, PodsTooltip, PublicEndpointTooltip, InstantClusterTooltip, WorkerTooltip } from "/snippets/tooltips.jsx";

Runpod offers cloud computing resources for AI and machine learning workloads. You can choose from instant GPUs for development, auto-scaling <ServerlessTooltip /> computing, pre-deployed AI models, or multi-node clusters for distributed training.

## [Serverless](/serverless/overview)

Serverless provides pay-per-second computing with automatic scaling for production AI workloads. You only pay for actual compute time when your code runs, with no idle costs, making Serverless ideal for variable workloads and cost-efficient production deployments.

## [Pods](/pods/overview)

Pods give you dedicated GPU or CPU instances for containerized workloads. Pods are billed by the minute and stay available as long as you keep them running, making them perfect for development, training, and workloads that need continuous access.
<PodsTooltip /> give you dedicated GPU or CPU instances for containerized workloads. Pods are billed by the minute and stay available as long as you keep them running, making them perfect for development, training, and workloads that need continuous access.

## [Public Endpoints](/hub/public-endpoints)

Public Endpoints provide instant API access to pre-deployed AI models for image, video, and text generation without any setup. You only pay for what you generate, making it easy to integrate AI into your applications without managing infrastructure.
<PublicEndpointTooltip />s provide instant API access to pre-deployed AI models for image, video, and text generation without any setup. You only pay for what you generate, making it easy to integrate AI into your applications without managing infrastructure.

## [Instant Clusters](/instant-clusters)

Instant Clusters deliver fully managed multi-node compute clusters for large-scale distributed workloads. With high-speed networking between nodes, you can run multi-node training, fine-tune large language models, and handle other tasks that require multiple GPUs working in parallel.
<InstantClusterTooltip />s deliver fully managed multi-node compute clusters for large-scale distributed workloads. With high-speed networking between nodes, you can run multi-node training, fine-tune large language models, and handle other tasks that require multiple GPUs working in parallel.


## Choosing the right option
Expand Down
10 changes: 6 additions & 4 deletions hub/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ sidebarTitle: "Overview"
description: "Discover, deploy, and share preconfigured AI repos using the Runpod Hub."
---

The [Runpod Hub](https://console.runpod.io/hub) is a centralized repository that enables users to discover, share, and deploy preconfigured AI repos optimized for Runpod's [Serverless](/serverless/overview/) and [Pod](/pods/overview) infrastructure. It offers a catalog of vetted, open-source repositories that can be deployed with minimal setup, creating a collaborative ecosystem for AI developers and users.
import { ServerlessTooltip, PodTooltip, EndpointTooltip, PublicEndpointTooltip, HandlerFunctionTooltip, WorkerTooltip } from "/snippets/tooltips.jsx";

The [Runpod Hub](https://console.runpod.io/hub) is a centralized repository that enables users to discover, share, and deploy preconfigured AI repos optimized for Runpod's <ServerlessTooltip /> and <PodTooltip /> infrastructure. It offers a catalog of vetted, open-source repositories that can be deployed with minimal setup, creating a collaborative ecosystem for AI developers and users.

Whether you're a developer looking to share your work or a user seeking preconfigured solutions, the Hub makes discovering and deploying AI projects seamless and efficient.

Expand Down Expand Up @@ -32,7 +34,7 @@ The Hub simplifies the entire lifecycle of repo sharing and deployment, from ini

## Public Endpoints

In addition to official and community-submitted repos, the Hub also offers [Public Endpoints](/hub/public-endpoints) for popular AI models. These are ready-to-use APIs that you can integrate directly into your applications without needing to manage any of the underlying infrastructure.
In addition to official and community-submitted repos, the Hub also offers <PublicEndpointTooltip />s for popular AI models. These are ready-to-use APIs that you can integrate directly into your applications without needing to manage any of the underlying infrastructure.

Public Endpoints provide:

Expand Down Expand Up @@ -63,7 +65,7 @@ You can deploy a repo from the Hub in seconds, choosing between Serverless endpo
4. Click the **Deploy** button in the top-right of the repo page. You can also use the dropdown menu to deploy an older version.
5. Click **Create Endpoint**

Within minutes you'll have access to a new Serverless endpoint, ready for integration with your applications or experimentation.
Within minutes you'll have access to a new Serverless <EndpointTooltip />, ready for integration with your applications or experimentation.

### Deploy as a Pod

Expand Down Expand Up @@ -96,7 +98,7 @@ Where `POD_ID` is your Pod's actual ID.

## Publish your own repo

You can [publish your own repo](/hub/publishing-guide) on the Hub by preparing your GitHub repository with a working [Serverless endpoint](/serverless/overview) implementation, comprised of a [worker handler function](/serverless/workers/handler-functions) and `Dockerfile`.
You can [publish your own repo](/hub/publishing-guide) on the Hub by preparing your GitHub repository with a working Serverless endpoint implementation, comprised of a <WorkerTooltip /> <HandlerFunctionTooltip /> and `Dockerfile`.

<Tip>
To learn how to build your first worker, [follow this guide](/serverless/workers/custom-worker).
Expand Down
6 changes: 4 additions & 2 deletions instant-clusters.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ sidebarTitle: "Overview"
description: "Fully managed compute clusters for multi-node training and AI inference."
---

import { DataCenterTooltip, PyTorchTooltip } from "/snippets/tooltips.jsx";

<Tip>

Runpod offers custom Instant Cluster pricing plans for large scale and enterprise workloads. If you're interested in learning more, [contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA).
Expand Down Expand Up @@ -37,15 +39,15 @@ Instant Clusters feature high-speed local networking for efficient data movement
* Most clusters include 3200 Gbps networking.
* A100 clusters offer up to 1600 Gbps networking.

This fast networking enables efficient scaling of distributed training and inference workloads. Runpod ensures nodes selected for clusters are within the same data center for optimal performance.
This fast networking enables efficient scaling of distributed training and inference workloads. Runpod ensures nodes selected for clusters are within the same <DataCenterTooltip /> for optimal performance.

## Zero configuration

Runpod automates cluster setup so you can focus on your workloads:

* Clusters are pre-configured with static IP address management.
* All necessary [environment variables](#environment-variables) for distributed training are pre-configured.
* Supports popular frameworks like PyTorch, TensorFlow, and Slurm.
* Supports popular frameworks like <PyTorchTooltip />, TensorFlow, and Slurm.

## Get started

Expand Down
4 changes: 3 additions & 1 deletion instant-clusters/axolotl.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,9 @@ After running the command on the last Pod, you should see output similar to this
[2025-04-01 19:24:22,603] [INFO] [axolotl.train.save_trained_model:211] [PID:1009] [RANK:0] Training completed! Saving pre-trained model to ./outputs/lora-out.
```

Congrats! You've successfully trained a model using Axolotl on an Instant Cluster. Your fine-tuned model has been saved to the `./outputs/lora-out` directory. You can now use this model for inference or continue training with different parameters.
<Check>
Congratulations! You've successfully trained a model using Axolotl on an Instant Cluster. Your fine-tuned model has been saved to the `./outputs/lora-out` directory. You can now use this model for inference or continue training with different parameters.
</Check>

## Step 4: Clean up

Expand Down
4 changes: 3 additions & 1 deletion instant-clusters/pytorch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ title: "Deploy an Instant Cluster with PyTorch"
sidebarTitle: "PyTorch"
---

This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
import { PyTorchTooltip } from "/snippets/tooltips.jsx";

This tutorial demonstrates how to use Instant Clusters with <PyTorchTooltip /> to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.

Follow the steps below to deploy a cluster and start running distributed PyTorch workloads efficiently.

Expand Down
4 changes: 3 additions & 1 deletion pods/choose-a-pod.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ description: "Select the right Pod by evaluating your resource requirements."
sidebar_position: 3
---

import { CUDATooltip } from "/snippets/tooltips.jsx";

Selecting the appropriate Pod configuration is a crucial step in maximizing performance and efficiency for your specific workloads. This guide will help you understand the key factors to consider when choosing a Pod that meets your requirements.

## Understanding your workload needs
Expand All @@ -28,7 +30,7 @@ There are several online tools that can help you estimate your resource requirem

### GPU selection

The GPU is the cornerstone of computational performance for many workloads. When selecting your GPU, consider the architecture that best suits your software requirements. NVIDIA GPUs with CUDA support are essential for most machine learning frameworks, while some applications might perform better on specific GPU generations. Evaluate both the raw computing power (CUDA cores, tensor cores) and the memory bandwidth to ensure optimal performance for your specific tasks.
The GPU is the cornerstone of computational performance for many workloads. When selecting your GPU, consider the architecture that best suits your software requirements. NVIDIA GPUs with <CUDATooltip /> support are essential for most machine learning frameworks, while some applications might perform better on specific GPU generations. Evaluate both the raw computing power (CUDA cores, tensor cores) and the memory bandwidth to ensure optimal performance for your specific tasks.

For machine learning inference, a mid-range GPU might be sufficient, while training large models requires more powerful options. Check framework-specific recommendations, as PyTorch, TensorFlow, and other frameworks may perform differently across GPU types.

Expand Down
Loading