diff --git a/docs.json b/docs.json
index eb5cb51c..bc8939df 100644
--- a/docs.json
+++ b/docs.json
@@ -119,6 +119,19 @@
}
]
},
+ {
+ "group": "Flash",
+ "pages": [
+ "flash/overview",
+ "flash/quickstart",
+ "flash/pricing",
+ "flash/remote-functions",
+ "flash/api-endpoints",
+ "flash/deploy-apps",
+ "flash/resource-configuration",
+ "flash/monitoring"
+ ]
+ },
{
"group": "Pods",
"pages": [
diff --git a/flash/api-endpoints.mdx b/flash/api-endpoints.mdx
new file mode 100644
index 00000000..b8f04631
--- /dev/null
+++ b/flash/api-endpoints.mdx
@@ -0,0 +1,236 @@
+---
+title: "Create a Flash API endpoint"
+sidebarTitle: "Create an endpoint"
+description: "Build and serve HTTP APIs using FastAPI with Flash."
+tag: "BETA"
+---
+
+Flash API endpoints let you build HTTP APIs with FastAPI that run on Runpod Serverless workers. Use them to deploy production APIs that need GPU or CPU acceleration.
+
+Unlike standalone scripts that run once and return results, this lets you create a persistent endpoint for handling incoming HTTP requests. Each request is processed by a Serverless worker using the same remote functions you'd use in a standalone script.
+
+
+
+Flash API endpoints are currently available for local testing only. Run `flash run` to start the API server on your local machine. Production deployment support is coming in future updates.
+
+
+
+## Step 1: Initialize a new project
+
+Use the `flash init` command to generate a structured project template with a preconfigured FastAPI application entry point.
+
+Run this command to initialize a new project directory:
+
+```bash
+flash init my_project
+```
+
+You can also initialize your current directory:
+
+```bash
+flash init
+```
+
+## Step 2: Explore the project template
+
+This is the structure of the project template created by `flash init`:
+
+```text
+my_project/
+├── main.py # FastAPI application entry point
+├── workers/
+│ ├── gpu/ # GPU worker example
+│ │ ├── __init__.py # FastAPI router
+│ │ └── endpoint.py # GPU script with @remote decorated function
+│ └── cpu/ # CPU worker example
+│ ├── __init__.py # FastAPI router
+│ └── endpoint.py # CPU script with @remote decorated function
+├── .env # Environment variable template
+├── .gitignore # Git ignore patterns
+├── .flashignore # Flash deployment ignore patterns
+├── requirements.txt # Python dependencies
+└── README.md # Project documentation
+```
+
+This template includes:
+
+- A FastAPI application entry point and routers.
+- Templates for Python dependencies, `.env`, `.gitignore`, etc.
+- Flash scripts (`endpoint.py`) for both GPU and CPU workers, which include:
+ - Pre-configured worker scaling limits using the `LiveServerless()` object.
+ - A `@remote` decorated function that returns a response from a worker.
+
+When you start the FastAPI server, it creates API endpoints at `/gpu/hello` and `/cpu/hello`, which call the remote function described in their respective `endpoint.py` files.
+
+## Step 3: Install Python dependencies
+
+After initializing the project, navigate into the project directory:
+
+```bash
+cd my_project
+```
+
+Install required dependencies:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Step 4: Configure your API key
+
+Open the `.env` template file in a text editor and add your [Runpod API key](/get-started/api-keys):
+
+```bash
+# Use your text editor of choice, e.g.
+cursor .env
+```
+
+Remove the `#` symbol from the beginning of the `RUNPOD_API_KEY` line and replace `your_api_key_here` with your actual Runpod API key:
+
+```text
+RUNPOD_API_KEY=your_api_key_here
+# FLASH_HOST=localhost
+# FLASH_PORT=8888
+# LOG_LEVEL=INFO
+```
+
+Save the file and close it.
+
+## Step 5: Start the local API server
+
+Use `flash run` to start the API server:
+
+```bash
+flash run
+```
+
+Open a new terminal tab or window and test your GPU API using cURL:
+
+```bash
+curl -X POST http://localhost:8888/gpu/hello \
+ -H "Content-Type: application/json" \
+ -d '{"message": "Hello from the GPU!"}'
+```
+
+If you switch back to the terminal tab where you used `flash run`, you'll see the details of the job's progress.
+
+### Faster testing with auto-provisioning
+
+For development with multiple endpoints, use `--auto-provision` to deploy all resources before testing:
+
+```bash
+flash run --auto-provision
+```
+
+This eliminates cold-start delays by provisioning all serverless endpoints upfront. Endpoints are cached and reused across server restarts, making subsequent runs faster. Resources are identified by name, so the same endpoint won't be re-deployed if the configuration hasn't changed.
+
+## Step 6: Open the API explorer
+
+Besides starting the API server, `flash run` also starts an interactive API explorer. Point your web browser at [http://localhost:8888/docs](http://localhost:8888/docs) to explore the API.
+
+To run remote functions in the explorer:
+
+1. Expand one of the functions under **GPU Workers** or **CPU Workers**.
+2. Click **Try it out** and then **Execute**.
+
+You'll get a response from your workers right in the explorer.
+
+## Step 7: Customize your API
+
+To customize your API endpoint and functionality:
+
+1. Add or edit remote functions in your `endpoint.py` files.
+2. Test the scripts individually by running `python endpoint.py`.
+3. Configure your FastAPI routers by editing the `__init__.py` files.
+4. Add any new endpoints to your `main.py` file.
+
+### Example: Adding a custom endpoint
+
+To add a new GPU endpoint for image generation:
+
+1. Create a new file at `workers/gpu/image_gen.py`:
+
+```python
+from tetra_rp import remote, LiveServerless, GpuGroup
+
+config = LiveServerless(
+ name="image-generator",
+ gpus=[GpuGroup.AMPERE_24],
+ workersMax=2
+)
+
+@remote(
+ resource_config=config,
+ dependencies=["diffusers", "torch", "transformers"]
+)
+def generate_image(prompt: str, width: int = 512, height: int = 512):
+ import torch
+ from diffusers import StableDiffusionPipeline
+ import base64
+ import io
+
+ pipeline = StableDiffusionPipeline.from_pretrained(
+ "runwayml/stable-diffusion-v1-5",
+ torch_dtype=torch.float16
+ ).to("cuda")
+
+ image = pipeline(prompt=prompt, width=width, height=height).images[0]
+
+ buffered = io.BytesIO()
+ image.save(buffered, format="PNG")
+ img_str = base64.b64encode(buffered.getvalue()).decode()
+
+ return {"image": img_str, "prompt": prompt}
+```
+
+2. Add a route in `workers/gpu/__init__.py`:
+
+```python
+from fastapi import APIRouter
+from .image_gen import generate_image
+
+router = APIRouter()
+
+@router.post("/generate")
+async def generate(prompt: str, width: int = 512, height: int = 512):
+ result = await generate_image(prompt, width, height)
+ return result
+```
+
+3. Include the router in `main.py` if not already included.
+
+## Load-balanced endpoints
+
+For API endpoints requiring low-latency HTTP access with direct routing, use load-balanced endpoints:
+
+```python
+from tetra_rp import LiveLoadBalancer, remote
+
+api = LiveLoadBalancer(name="api-service")
+
+@remote(api, method="POST", path="/api/process")
+async def process_data(x: int, y: int):
+ return {"result": x + y}
+
+@remote(api, method="GET", path="/api/health")
+def health_check():
+ return {"status": "ok"}
+
+# Call functions directly
+result = await process_data(5, 3) # → {"result": 8}
+```
+
+Key differences from queue-based endpoints:
+
+- **Direct HTTP routing**: Requests routed directly to workers, no queue.
+- **Lower latency**: No queuing overhead.
+- **Custom HTTP methods**: GET, POST, PUT, DELETE, PATCH support.
+- **No automatic retries**: Users handle errors directly.
+
+Load-balanced endpoints are ideal for REST APIs, webhooks, and real-time services. Queue-based endpoints are better for batch processing and fault-tolerant workflows.
+
+## Next steps
+
+- [Deploy Flash applications](/flash/deploy-apps) for production use.
+- [Configure resources](/flash/resource-configuration) for your endpoints.
+- [Monitor and debug](/flash/monitoring) your endpoints.
diff --git a/flash/deploy-apps.mdx b/flash/deploy-apps.mdx
new file mode 100644
index 00000000..ff01b646
--- /dev/null
+++ b/flash/deploy-apps.mdx
@@ -0,0 +1,175 @@
+---
+title: "Build and deploy Flash apps"
+sidebarTitle: "Deploy Flash apps"
+description: "Package and deploy Flash applications for production with `flash build`."
+tag: "BETA"
+---
+
+Flash uses a build process to package your application for deployment. This page covers how the build process works, including handler generation, cross-platform builds, and troubleshooting common issues.
+
+## Build process and handler generation
+
+When you run `flash build`, the following happens:
+
+1. **Discovery**: Flash scans your code for `@remote` decorated functions.
+2. **Grouping**: Functions are grouped by their `resource_config`.
+3. **Handler generation**: For each resource config, Flash generates a lightweight handler file.
+4. **Manifest creation**: A `flash_manifest.json` file maps functions to their endpoints.
+5. **Dependency installation**: Python packages are installed with Linux `x86_64` compatibility.
+6. **Packaging**: Everything is bundled into `archive.tar.gz` for deployment.
+
+### Handler architecture
+
+Flash uses a factory pattern for handlers to eliminate code duplication:
+
+```python
+# Generated handler (handler_gpu_config.py)
+from tetra_rp.runtime.generic_handler import create_handler
+from workers.gpu import process_data
+
+FUNCTION_REGISTRY = {
+ "process_data": process_data,
+}
+
+handler = create_handler(FUNCTION_REGISTRY)
+```
+
+This approach provides:
+
+- **Single source of truth**: All handler logic in one place.
+- **Easier maintenance**: Bug fixes don't require rebuilding projects.
+
+## Cross-platform builds
+
+Flash automatically handles cross-platform builds, ensuring your deployments work correctly regardless of your development platform:
+
+- **Automatic platform targeting**: Dependencies are installed for Linux `x86_64` (required for [Runpod Serverless](/serverless/overview)), even when building on macOS or Windows.
+- **Python version matching**: The build uses your current Python version to ensure package compatibility.
+- **Binary wheel enforcement**: Only pre-built binary wheels are used, preventing platform-specific compilation issues.
+
+This means you can build on macOS ARM64, Windows, or any other platform, and the resulting package will run correctly on [Runpod Serverless](/serverless/overview).
+
+## Cross-endpoint function calls
+
+Flash enables functions on different endpoints to call each other:
+
+```python
+# CPU endpoint function
+@remote(resource_config=cpu_config)
+def preprocess(data):
+ # Preprocessing logic
+ return clean_data
+
+# GPU endpoint function
+@remote(resource_config=gpu_config)
+async def inference(data):
+ # Can call CPU endpoint function
+ clean = await preprocess(data)
+ # Run inference on clean data
+ return result
+```
+
+The runtime automatically discovers endpoints and routes calls appropriately using the [`flash_manifest.json`](#build-artifacts) file generated during the build process. This lets you build pipelines that use CPU workers for preprocessing and GPU workers for inference, optimizing costs by using the appropriate hardware for each task.
+
+## Build artifacts
+
+After running `flash build`, you'll find these artifacts in the `.flash/` directory:
+
+| Artifact | Description |
+|----------|-------------|
+| `.flash/.build/` | Temporary build directory (removed unless `--keep-build`) |
+| `.flash/archive.tar.gz` | Deployment package |
+| `.flash/flash_manifest.json` | Service discovery configuration |
+
+### Managing bundle size
+
+Runpod Serverless has a **500MB deployment limit**. Exceeding this limit will cause your build to fail.
+
+Use `--exclude` to skip packages that are already included in your base worker image:
+
+```bash
+# For GPU deployments (PyTorch pre-installed)
+flash build --exclude torch,torchvision,torchaudio
+```
+
+Which packages to exclude depends on your [resource config](/flash/resource-configuration):
+
+- **GPU resources** use PyTorch as the base image, which has `torch`, `torchvision`, and `torchaudio` pre-installed.
+- **CPU resources** use Python slim images, which have no ML frameworks pre-installed.
+- **Load-balancer** resources use the same base image as their GPU/CPU counterparts.
+
+
+ You can find details about the Flash worker image in the [runpod-workers/worker-tetra](https://github.com/runpod/worker-tetra) repository. Find the `Dockerfile` for your endpoint type: `Dockerfile` (for GPU workers), `Dockerfile-cpu` (for CPU workers), or `Dockerfile-lb` (for load balancing workers).
+
+
+## Troubleshooting
+
+### No @remote functions found
+
+If the build process can't find your remote functions:
+
+- Ensure your functions are decorated with `@remote(resource_config)`.
+- Check that Python files are not excluded by `.gitignore` or `.flashignore`.
+- Verify function decorators have valid syntax.
+
+### Handler generation failed
+
+If handler generation fails:
+
+- Check for syntax errors in your Python files (they should be logged in the terminal).
+- Verify all imports in your worker modules are available.
+- Ensure resource config variables (e.g., `gpu_config`) are defined before a function references them.
+- Use `--keep-build` to inspect generated handler files in `.flash/.build/`.
+
+### Build succeeded but deployment failed
+
+If the build succeeds but deployment fails:
+
+- Verify all function imports work in the deployment environment.
+- Check that environment variables required by your functions are available.
+- Review the generated `flash_manifest.json` for correct function mappings.
+
+### Dependency installation failed
+
+If dependency installation fails during the build:
+
+- If a package doesn't have pre-built Linux `x86_64`` wheels, the build will fail with an error.
+- For newer Python versions (3.13+), some packages may require `manylinux_2_27`` or higher.
+- Ensure you have standard pip installed (`python -m ensurepip --upgrade`) for best compatibility.
+- Check PyPI to verify the package supports your Python version on Linux.
+
+### Authentication errors
+
+If you're seeing authentication errors:
+
+Verify your API key is set correctly:
+
+```bash
+echo $RUNPOD_API_KEY # Should show your key
+```
+
+### Import errors in remote functions
+
+Remember to import packages inside remote functions:
+
+```python
+@remote(dependencies=["requests"])
+def fetch_data(url):
+ import requests # Import here, not at top of file
+ return requests.get(url).json()
+```
+
+## Performance optimization
+
+To optimize performance:
+
+- Set `workersMin=1` to keep workers warm and avoid cold starts.
+- Use `idleTimeout` to balance cost and responsiveness.
+- Choose appropriate GPU types for your workload.
+- Use `--auto-provision` with `flash run` to eliminate cold-start delays during development.
+
+## Next steps
+
+- [View the resource configuration reference](/flash/resource-configuration) for all available options.
+- [Monitor and debug](/flash/monitoring) your deployments.
+- [Learn about pricing](/flash/pricing) to optimize costs.
diff --git a/flash/monitoring.mdx b/flash/monitoring.mdx
new file mode 100644
index 00000000..fb206f58
--- /dev/null
+++ b/flash/monitoring.mdx
@@ -0,0 +1,191 @@
+---
+title: "Monitoring and debugging"
+sidebarTitle: "Monitoring and debugging"
+description: "Monitor, debug, and troubleshoot Flash deployments."
+tag: "BETA"
+---
+
+This page covers how to monitor and debug your Flash deployments, including viewing logs, troubleshooting common issues, and optimizing performance.
+
+## Viewing logs
+
+When running Flash functions, logs are displayed in your terminal. The output includes:
+
+- Endpoint creation and reuse status.
+- Job submission and queue status.
+- Execution progress.
+- Worker information (delay time, execution time).
+
+Example output:
+
+```text
+2025-11-19 12:35:15,109 | INFO | Created endpoint: rb50waqznmn2kg - flash-quickstart-fb
+2025-11-19 12:35:15,112 | INFO | URL: https://console.runpod.io/serverless/user/endpoint/rb50waqznmn2kg
+2025-11-19 12:35:15,114 | INFO | LiveServerless:rb50waqznmn2kg | API /run
+2025-11-19 12:35:15,655 | INFO | LiveServerless:rb50waqznmn2kg | Started Job:b0b341e7-e460-4305-9acd-fc2dfd1bd65c-u2
+2025-11-19 12:35:15,762 | INFO | Job:b0b341e7-e460-4305-9acd-fc2dfd1bd65c-u2 | Status: IN_QUEUE
+2025-11-19 12:36:09,983 | INFO | Job:b0b341e7-e460-4305-9acd-fc2dfd1bd65c-u2 | Status: COMPLETED
+2025-11-19 12:36:10,068 | INFO | Worker:icmkdgnrmdf8gz | Delay Time: 51842 ms
+2025-11-19 12:36:10,068 | INFO | Worker:icmkdgnrmdf8gz | Execution Time: 1533 ms
+```
+
+### Log levels
+
+You can control log verbosity using the `LOG_LEVEL` environment variable:
+
+```bash
+LOG_LEVEL=DEBUG python your_script.py
+```
+
+Available log levels: `DEBUG`, `INFO`, `WARNING`, `ERROR`.
+
+## Monitoring in the Runpod console
+
+View detailed metrics and logs in the [Runpod console](https://www.runpod.io/console/serverless):
+
+1. Navigate to the **Serverless** section.
+2. Click on your endpoint to view:
+ - Active workers and queue depth.
+ - Request history and job status.
+ - Worker logs and execution details.
+ - Metrics (requests, latency, errors).
+
+### Endpoint metrics
+
+The console provides metrics including:
+
+- **Request rate**: Number of requests per minute.
+- **Queue depth**: Number of pending requests.
+- **Latency**: Average response time.
+- **Worker count**: Active and idle workers.
+- **Error rate**: Failed requests percentage.
+
+## Debugging common issues
+
+### Cold start delays
+
+If you're experiencing slow initial responses:
+
+- **Cause**: Workers need time to start, load dependencies, and initialize models.
+- **Solutions**:
+ - Set `workersMin=1` to keep at least one worker warm.
+ - Use smaller models or optimize model loading.
+ - Use `--auto-provision` with `flash run` for development.
+
+```python
+config = LiveServerless(
+ name="always-warm",
+ workersMin=1, # Keep one worker always running
+ idleTimeout=30 # Longer idle timeout
+)
+```
+
+### Timeout errors
+
+If requests are timing out:
+
+- **Cause**: Execution taking longer than the timeout limit.
+- **Solutions**:
+ - Increase `executionTimeoutMs` in your configuration.
+ - Optimize your function to run faster.
+ - Break long operations into smaller chunks.
+
+```python
+config = LiveServerless(
+ name="long-running",
+ executionTimeoutMs=600000 # 10 minutes
+)
+```
+
+### Memory errors
+
+If you're seeing out-of-memory errors:
+
+- **Cause**: Model or data too large for available GPU/CPU memory.
+- **Solutions**:
+ - Use a larger GPU type (e.g., `GpuGroup.AMPERE_80` for 80GB VRAM).
+ - Use model quantization or smaller batch sizes.
+ - Clear GPU memory between operations.
+
+```python
+config = LiveServerless(
+ name="large-model",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ template=PodTemplate(containerDiskInGb=100) # More disk space
+)
+```
+
+### Dependency errors
+
+If packages aren't being installed correctly:
+
+- **Cause**: Missing or incompatible dependencies.
+- **Solutions**:
+ - Verify package names and versions in the `dependencies` list.
+ - Check that packages have Linux `x86_64` wheels available.
+ - Import packages inside the function, not at the top of the file.
+
+```python
+@remote(
+ resource_config=config,
+ dependencies=["torch==2.0.0", "transformers==4.36.0"] # Pin versions
+)
+def my_function(data):
+ import torch # Import inside the function
+ import transformers
+ # ...
+```
+
+### Authentication errors
+
+If you're seeing API key errors:
+
+- **Cause**: Missing or invalid Runpod API key.
+- **Solutions**:
+ - Verify your API key is set in the environment.
+ - Check that the `.env` file is in the correct directory.
+ - Ensure the API key has the required permissions.
+
+```bash
+# Check if API key is set
+echo $RUNPOD_API_KEY
+
+# Set API key directly
+export RUNPOD_API_KEY=your_api_key_here
+```
+
+## Performance optimization
+
+### Reducing cold starts
+
+- Set `workersMin=1` for endpoints that need fast responses.
+- Use `idleTimeout` to balance cost and warm worker availability.
+- Cache models on network volumes to reduce loading time.
+
+### Optimizing execution time
+
+- Profile your functions to identify bottlenecks.
+- Use appropriate GPU types for your workload.
+- Batch multiple inputs into a single request when possible.
+- Use async operations to parallelize independent tasks.
+
+### Managing costs
+
+- Set appropriate `workersMax` limits to control scaling.
+- Use CPU workers for non-GPU tasks.
+- Monitor usage in the console to identify optimization opportunities.
+- Use shorter `idleTimeout` for sporadic workloads.
+
+## Endpoint management
+
+As you work with Flash, endpoints accumulate in your Runpod account. To manage them:
+
+1. Go to the [Serverless section](https://www.runpod.io/console/serverless) in the Runpod console.
+2. Review your endpoints and delete unused ones.
+3. Note that a `flash undeploy` command is in development for easier cleanup.
+
+
+
+Endpoints persist until manually deleted through the Runpod console. Regularly clean up unused endpoints to avoid hitting your account's maximum worker capacity limits.
+
+
\ No newline at end of file
diff --git a/flash/overview.mdx b/flash/overview.mdx
new file mode 100644
index 00000000..8845b165
--- /dev/null
+++ b/flash/overview.mdx
@@ -0,0 +1,191 @@
+---
+title: "Overview"
+sidebarTitle: "Overview"
+description: "Rapidly develop and deploy AI/ML apps with the Flash Python SDK."
+tag: "BETA"
+---
+
+
+Flash is currently in beta. [Join our Discord](https://discord.gg/cUpRmau42V) to provide feedback and get support.
+
+
+Flash is a Python SDK for developing and deploying AI workflows on [Runpod Serverless](/serverless/overview). You write Python functions locally, and Flash handles infrastructure management, GPU/CPU provisioning, dependency installation, and data transfer automatically.
+
+There are two ways to run workloads with Flash:
+
+- **Standalone scripts:** Add the `@remote` decorator to Python functions, and they'll run automatically on Runpod's cloud infrastructure when you run the script locally.
+- **API endpoints:** Convert those functions into persistent endpoints that can be accessed via HTTP, scaling GPU/CPU resources automatically based on demand.
+
+Ready to try it out? Check out the quickstart guide and examples repository:
+
+
+
+ Follow the quickstart to create your first Flash function in minutes.
+
+
+
+ Check out our repository of prebuilt Flash applications.
+
+
+
+ Learn about resource configuration, dependencies, and parallel execution.
+
+
+ Build HTTP APIs with FastAPI and Flash.
+
+
+
+
+## Why use Flash?
+
+**Flash is the easiest and fastest way to test and deploy AI/ML workloads on Runpod.** It's designed for local development and live-testing workflows, but can also be used to deploy production-ready applications.
+
+When you run a `@remote` function, Flash:
+- Automatically provisions resources on Runpod's infrastructure.
+- Installs your dependencies automatically.
+- Runs your function on a remote GPU/CPU.
+- Returns the result to your local environment.
+
+You can specify the exact GPU hardware you need, from RTX 4090s to A100 80GB GPUs, for AI inference, training, and other compute-intensive tasks. Functions scale automatically based on demand and can run in parallel across multiple resources.
+
+Flash uses [Runpod's Serverless pricing](/serverless/pricing) with per-second billing. You're only charged for actual compute time; there are no costs when your code isn't running.
+
+
+## Install Flash
+
+Install Flash with `pip`:
+
+```bash
+pip install tetra_rp`.
+```
+
+In your project directory, create a `.env` file and add your Runpod API key, replacing `YOUR_API_KEY` with your actual API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+## Concepts
+
+### Remote functions
+
+The `@remote` decorator marks functions for execution on Runpod's infrastructure. Code inside the decorated function runs remotely on a Serverless worker, while code outside the function runs locally on your machine.
+
+```python
+@remote(resource_config=config, dependencies=["pandas"])
+def process_data(data):
+ # This code runs remotely on Runpod
+ import pandas as pd
+ df = pd.DataFrame(data)
+ return df.describe().to_dict()
+
+async def main():
+ # This code runs locally
+ result = await process_data(my_data)
+```
+
+### Resource configuration
+
+Flash provides fine-grained control over hardware allocation through configuration objects. You can configure GPU types, worker counts, idle timeouts, environment variables, and more.
+
+```python
+from tetra_rp import LiveServerless, GpuGroup
+
+gpu_config = LiveServerless(
+ name="ml-inference",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ workersMax=5
+)
+```
+
+### Dependency management
+
+Specify Python packages in the decorator, and Flash installs them automatically on the remote worker:
+
+```python
+@remote(
+ resource_config=gpu_config,
+ dependencies=["transformers==4.36.0", "torch", "pillow"]
+)
+def generate_image(prompt):
+ # Import inside the function
+ from transformers import pipeline
+ # ...
+```
+
+Imports should be placed inside the function body because they need to happen on the remote worker, not in your local environment.
+
+### Parallel execution
+
+Run multiple remote functions concurrently using Python's async capabilities:
+
+```python
+results = await asyncio.gather(
+ process_item(item1),
+ process_item(item2),
+ process_item(item3)
+)
+```
+
+## How it works
+
+Flash orchestrates workflow execution through a multi-step process:
+
+1. **Function identification**: The `@remote` decorator marks functions for remote execution, enabling Flash to distinguish between local and remote operations.
+2. **Dependency analysis**: Flash automatically analyzes function dependencies to construct an optimal execution order.
+3. **Resource provisioning and execution**: For each remote function, Flash:
+ - Dynamically provisions endpoint and worker resources on Runpod's infrastructure.
+ - Serializes and securely transfers input data to the remote worker.
+ - Executes the function on the remote infrastructure with the specified GPU or CPU resources.
+ - Returns results to your local environment.
+4. **Data orchestration**: Results flow seamlessly between functions according to your local Python code structure.
+
+## Use cases
+
+Flash is well-suited for a range of AI and data processing workloads:
+
+- **Multi-modal AI pipelines**: Orchestrate unified workflows combining text, image, and audio models with GPU acceleration.
+- **Distributed model training**: Scale training operations across multiple GPU workers for faster model development.
+- **AI research experimentation**: Rapidly prototype and test complex model combinations without infrastructure overhead.
+- **Production inference systems**: Deploy multi-stage inference pipelines for real-world applications.
+- **Data processing workflows**: Process large datasets using CPU workers for general computation and GPU workers for accelerated tasks.
+- **Hybrid GPU/CPU workflows**: Optimize cost and performance by combining CPU preprocessing with GPU inference.
+
+## Development workflow
+
+A typical Flash development workflow looks like this:
+
+1. Write Python functions with the `@remote` decorator.
+2. Specify resource requirements and dependencies in the decorator.
+3. Run your script locally. Flash handles remote execution automatically.
+
+For API deployments, use `flash init` to create a project, then `flash run` to start your server. For a full walkthrough, see [Create a Flash API endpoint](/flash/api-endpoints).
+
+## Limitations
+
+- Serverless deployments using Flash are currently restricted to the `EU-RO-1` datacenter.
+- Flash is designed primarily for local development and live-testing workflows.
+- Endpoints created by Flash persist until manually deleted through the Runpod console. A `flash undeploy` command is currently in development to clean up unused endpoints.
+- Be aware of your account's maximum worker capacity limits. Flash can rapidly scale workers across multiple endpoints, and you may hit capacity constraints. Contact [Runpod support](https://www.runpod.io/contact) to increase your account's capacity allocation if needed.
+
+## Next steps
+
+
+
+ Get started with your first Flash function.
+
+
+ Complete reference for resource configuration options.
+
+
+
+
+## Getting help
+
+- Join the [Runpod community on Discord](https://discord.gg/cUpRmau42V) for support and discussion.
+
+## Next steps
+
+- [View the resource configuration reference](/flash/resource-configuration) for all available options.
+- [Learn about pricing](/flash/pricing) to optimize costs.
+- [Deploy Flash applications](/flash/deploy-apps) for production.
diff --git a/flash/pricing.mdx b/flash/pricing.mdx
new file mode 100644
index 00000000..f6312c72
--- /dev/null
+++ b/flash/pricing.mdx
@@ -0,0 +1,109 @@
+---
+title: "Pricing"
+sidebarTitle: "Pricing"
+description: "Understand Flash pricing and optimize your costs."
+tag: "BETA"
+---
+
+Flash follows the same pricing model as [Runpod Serverless](/serverless/pricing). You pay per second of compute time, with no charges when your code isn't running. Pricing depends on the GPU or CPU type you configure for your endpoints.
+
+## How pricing works
+
+You're billed from when a worker starts until it completes your request, plus any idle time before scaling down. If a worker is already warm, you skip the cold start and only pay for execution time.
+
+### Compute cost breakdown
+
+Flash workers incur charges during these periods:
+
+1. **Start time**: The time required to initialize a worker and load models into GPU memory. This includes starting the container, installing dependencies, and preparing the runtime environment.
+2. **Execution time**: The time spent processing your request (running your `@remote` decorated function).
+3. **Idle time**: The period a worker remains active after completing a request, waiting for additional requests before scaling down.
+
+### Pricing by resource type
+
+Flash supports both GPU and CPU workers. Pricing varies based on the hardware type:
+
+- **GPU workers**: Use `LiveServerless` or `ServerlessEndpoint` with GPU configurations. Pricing depends on the GPU type (e.g., RTX 4090, A100 80GB).
+- **CPU workers**: Use `LiveServerless` or `CpuServerlessEndpoint` with CPU configurations. Pricing depends on the CPU instance type.
+
+See the [Serverless pricing page](/serverless/pricing) for current rates by GPU and CPU type.
+
+## How to estimate and optimize costs
+
+To estimate costs for your Flash workloads, consider:
+
+- How long each function takes to execute.
+- How many concurrent workers you need (`workersMax` setting).
+- Which GPU or CPU types you'll use.
+- Your idle timeout configuration (`idleTimeout` setting).
+
+### Cost optimization strategies
+
+#### Choose appropriate hardware
+
+Select the smallest GPU or CPU that meets your performance requirements. For example, if your workload fits in 24GB of VRAM, use `GpuGroup.ADA_24` or `GpuGroup.AMPERE_24` instead of larger GPUs.
+
+```python
+# Cost-effective configuration for workloads that fit in 24GB VRAM
+config = LiveServerless(
+ name="cost-optimized",
+ gpus=[GpuGroup.ADA_24, GpuGroup.AMPERE_24], # RTX 4090, L4, A5000, 3090
+)
+```
+
+#### Configure idle timeouts
+
+Balance responsiveness and cost by adjusting the `idleTimeout` parameter. Shorter timeouts reduce idle costs but increase cold starts for sporadic traffic.
+
+```python
+# Lower idle timeout for cost savings (more cold starts)
+config = LiveServerless(
+ name="low-idle",
+ idleTimeout=5, # 5 seconds (default)
+)
+
+# Higher idle timeout for responsiveness (higher idle costs)
+config = LiveServerless(
+ name="responsive",
+ idleTimeout=30, # 30 seconds
+)
+```
+
+#### Use CPU workers for non-GPU tasks
+
+For data preprocessing, postprocessing, or other tasks that don't require GPU acceleration, use CPU workers instead of GPU workers.
+
+```python
+from tetra_rp import LiveServerless, CpuInstanceType
+
+# CPU configuration for non-GPU tasks
+cpu_config = LiveServerless(
+ name="data-processor",
+ instanceIds=[CpuInstanceType.CPU5C_2_4], # 2 vCPU, 4GB RAM
+)
+```
+
+#### Limit maximum workers
+
+Set `workersMax` to prevent runaway scaling and unexpected costs:
+
+```python
+config = LiveServerless(
+ name="controlled-scaling",
+ workersMax=3, # Limit to 3 concurrent workers
+)
+```
+
+### Monitoring costs
+
+Monitor your usage in the [Runpod console](https://www.runpod.io/console/serverless) to track:
+
+- Total compute time across endpoints.
+- Worker utilization and idle time.
+- Cost breakdown by endpoint.
+
+## Next steps
+
+- [Create remote functions](/flash/remote-functions) with optimized resource configurations.
+- [View Serverless pricing details](/serverless/pricing) for current rates.
+- [Configure resources](/flash/resource-configuration) for your workloads.
diff --git a/flash/quickstart.mdx b/flash/quickstart.mdx
new file mode 100644
index 00000000..7bdfc665
--- /dev/null
+++ b/flash/quickstart.mdx
@@ -0,0 +1,325 @@
+---
+title: "Get started with Flash"
+sidebarTitle: "Quickstart"
+description: "Set up your development environment and run your first GPU workload with Flash."
+tag: "BETA"
+---
+
+This tutorial shows you how to set up Flash and run a GPU workload on Runpod Serverless. You'll create a remote function that performs matrix operations on a GPU and returns the results to your local machine.
+
+## What you'll learn
+
+In this tutorial you'll learn how to:
+
+- Set up your development environment for Flash.
+- Configure a Serverless endpoint using a `LiveServerless` object.
+- Create and define remote functions with the `@remote` decorator.
+- Deploy a GPU-based workload using Runpod resources.
+- Pass data between your local environment and remote workers.
+- Run multiple operations in parallel.
+
+## Requirements
+
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've [created a Runpod API key](/get-started/api-keys).
+- You've installed [Python 3.9 (or higher)](https://www.python.org/downloads/).
+
+## Step 1: Install Flash
+
+Use `pip` to install Flash:
+
+```bash
+pip install tetra_rp
+```
+
+## Step 2: Add your API key to the environment
+
+Add your Runpod API key to your development environment before using Flash to run workloads.
+
+Run this command to create a `.env` file, replacing `YOUR_API_KEY` with your Runpod API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+
+
+You can create this in your project's root directory or in the `/examples` folder. Make sure your `.env` file is in the same folder as the Python file you create in the next step.
+
+
+
+## Step 3: Create your project file
+
+Create a new file called `matrix_operations.py` in the same directory as your `.env` file:
+
+```bash
+touch matrix_operations.py
+```
+
+Open this file in your code editor. The following steps walk through building a matrix multiplication example that demonstrates Flash's remote execution and parallel processing capabilities.
+
+## Step 4: Add imports and load the .env file
+
+Add the necessary import statements:
+
+```python
+import asyncio
+from dotenv import load_dotenv
+from tetra_rp import remote, LiveServerless, GpuGroup
+
+# Load environment variables from .env file
+load_dotenv()
+```
+
+This imports:
+
+- `asyncio`: Python's asynchronous programming library, which Flash uses for non-blocking execution.
+- `dotenv`: Loads environment variables from your `.env` file, including your Runpod API key.
+- `remote` and `LiveServerless`: The core Flash components for defining remote functions and their resource requirements.
+
+`load_dotenv()` reads your API key from the `.env` file and makes it available to Flash.
+
+## Step 5: Add Serverless endpoint configuration
+
+Define the Serverless endpoint configuration for your Flash workload:
+
+```python
+# Configuration for a Serverless endpoint using GPU workers
+gpu_config = LiveServerless(
+ gpus=[GpuGroup.AMPERE_24, GpuGroup.ADA_24], # Use any 24GB GPU
+ workersMax=3,
+ name="tetra_gpu",
+)
+```
+
+This `LiveServerless` object defines:
+
+- `gpus=[GpuGroup.AMPERE_24, GpuGroup.ADA_24]`: The GPUs that can be used by workers on this endpoint. This restricts workers to using any 24 GB GPU (L4, A5000, 3090, or 4090). See [GPU pools](/references/gpu-types#gpu-pools) for available GPU pool IDs. Removing this parameter allows the endpoint to use any available GPUs.
+- `workersMax=3`: The maximum number of worker instances.
+- `name="tetra_gpu"`: The name of the endpoint that will be created/used in the Runpod console.
+
+If you run a Flash function that uses an identical `LiveServerless` configuration to a prior run, Runpod reuses your existing endpoint rather than creating a new one. However, if any configuration values have changed (not just the `name` parameter), a new endpoint will be created.
+
+## Step 6: Define your remote function
+
+Define the function that will run on the GPU worker:
+
+```python
+@remote(
+ resource_config=gpu_config,
+ dependencies=["numpy", "torch"]
+)
+def tetra_matrix_operations(size):
+ """Perform large matrix operations using NumPy and check GPU availability."""
+ import numpy as np
+ import torch
+
+ # Get GPU count and name
+ device_count = torch.cuda.device_count()
+ device_name = torch.cuda.get_device_name(0)
+
+ # Create large random matrices
+ A = np.random.rand(size, size)
+ B = np.random.rand(size, size)
+
+ # Perform matrix multiplication
+ C = np.dot(A, B)
+
+ return {
+ "matrix_size": size,
+ "result_shape": C.shape,
+ "result_mean": float(np.mean(C)),
+ "result_std": float(np.std(C)),
+ "device_count": device_count,
+ "device_name": device_name
+ }
+```
+
+This code demonstrates several key concepts:
+
+- `@remote`: The decorator that marks the function for remote execution on Runpod's infrastructure.
+- `resource_config=gpu_config`: The function runs using the GPU configuration defined earlier.
+- `dependencies=["numpy", "torch"]`: Python packages that must be installed on the remote worker.
+
+The `tetra_matrix_operations` function:
+
+- Gets GPU details using PyTorch's CUDA utilities.
+- Creates two large random matrices using NumPy.
+- Performs matrix multiplication.
+- Returns statistics about the result and information about the GPU.
+
+Notice that `numpy` and `torch` are imported inside the function, not at the top of the file. These imports need to happen on the remote worker, not in your local environment.
+
+## Step 7: Add the main function
+
+Add a `main` function to execute your GPU workload:
+
+```python
+async def main():
+ # Run the GPU matrix operations
+ print("Starting large matrix operations on GPU...")
+ result = await tetra_matrix_operations(1000)
+
+ # Print the results
+ print("\nMatrix operations results:")
+ print(f"Matrix size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f"Result shape: {result['result_shape']}")
+ print(f"Result mean: {result['result_mean']:.4f}")
+ print(f"Result standard deviation: {result['result_std']:.4f}")
+
+ # Print GPU information
+ print("\nGPU Information:")
+ print(f"GPU device count: {result['device_count']}")
+ print(f"GPU device name: {result['device_name']}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+The `main` function:
+
+- Calls the remote function with `await`, which runs it asynchronously on Runpod's infrastructure.
+- Prints the results of the matrix operations.
+- Displays information about the GPU that was used.
+
+`asyncio.run(main())` is Python's standard way to execute an asynchronous `main` function from synchronous code.
+
+All code outside of the `@remote` decorated function runs on your local machine. The `main` function acts as a bridge between your local environment and Runpod's cloud infrastructure, allowing you to send input data to remote functions, wait for remote execution to complete without blocking your local process, and process returned results locally.
+
+The `await` keyword pauses execution of the `main` function until the remote operation completes, but doesn't block the entire Python process.
+
+## Step 8: Run your GPU example
+
+Run the example:
+
+```bash
+python matrix_operations.py
+```
+
+You should see output similar to this:
+
+```text
+Starting large matrix operations on GPU...
+Resource LiveServerless_33e1fa59c64b611c66c5a778e120c522 already exists, reusing.
+Registering RunPod endpoint: server_LiveServerless_33e1fa59c64b611c66c5a778e120c522 at https://api.runpod.ai/xvf32dan8rcilp
+Initialized RunPod stub for endpoint: https://api.runpod.ai/xvf32dan8rcilp (ID: xvf32dan8rcilp)
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+
+Matrix operations results:
+Matrix size: 1000x1000
+Result shape: (1000, 1000)
+Result mean: 249.8286
+Result standard deviation: 6.8704
+
+GPU Information:
+GPU device count: 1
+GPU device name: NVIDIA GeForce RTX 4090
+```
+
+
+If you're having trouble running your code due to authentication issues:
+
+1. Verify your `.env` file is in the same directory as your `matrix_operations.py` file.
+2. Check that the API key in your `.env` file is correct and properly formatted.
+
+Alternatively, you can set the API key directly in your terminal:
+
+
+
+```bash
+export RUNPOD_API_KEY=[YOUR_API_KEY]
+```
+
+
+```bash
+set RUNPOD_API_KEY=[YOUR_API_KEY]
+```
+
+
+
+
+## Step 9: Understand what's happening
+
+When you run this script:
+
+1. Flash reads your GPU resource configuration and provisions a worker on Runpod.
+2. It installs the required dependencies (NumPy and PyTorch) on the worker.
+3. Your `tetra_matrix_operations` function runs on the remote worker.
+4. The function creates and multiplies large matrices, then calculates statistics.
+5. Your local `main` function receives these results and displays them in your terminal.
+
+## Step 10: Run multiple operations in parallel
+
+Flash makes it easy to run multiple remote operations in parallel.
+
+Replace your `main` function with this code:
+
+```python
+async def main():
+ # Run multiple matrix operations in parallel
+ print("Starting large matrix operations on GPU...")
+
+ # Run all matrix operations in parallel
+ results = await asyncio.gather(
+ tetra_matrix_operations(500),
+ tetra_matrix_operations(1000),
+ tetra_matrix_operations(2000)
+ )
+
+ print("\nMatrix operations results:")
+
+ # Print the results for each matrix size
+ for result in results:
+ print(f"\nMatrix size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f"Result shape: {result['result_shape']}")
+ print(f"Result mean: {result['result_mean']:.4f}")
+ print(f"Result standard deviation: {result['result_std']:.4f}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+This updated `main` function demonstrates Flash's ability to run multiple operations in parallel using `asyncio.gather()`. Instead of running one matrix operation at a time, you're launching three operations with different matrix sizes (500, 1000, and 2000) simultaneously. This parallel execution significantly improves efficiency when you have multiple independent tasks.
+
+Run the example again:
+
+```bash
+python matrix_operations.py
+```
+
+You should see results for all three matrix sizes after the operations complete:
+
+```text
+Initial job status: IN_QUEUE
+Initial job status: IN_QUEUE
+Initial job status: IN_QUEUE
+Job completed, output received
+Job completed, output received
+Job completed, output received
+
+Matrix size: 500x500
+Result shape: (500, 500)
+Result mean: 125.3097
+Result standard deviation: 5.0425
+
+Matrix size: 1000x1000
+Result shape: (1000, 1000)
+Result mean: 249.9442
+Result standard deviation: 7.1072
+
+Matrix size: 2000x2000
+Result shape: (2000, 2000)
+Result mean: 500.1321
+Result standard deviation: 9.8879
+```
+
+## Next steps
+
+You've successfully used Flash to run a GPU workload on Runpod. Now you can:
+
+- [Create more complex remote functions](/flash/remote-functions) with custom dependencies and resource configurations.
+- [Build API endpoints](/flash/api-endpoints) using FastAPI.
+- [Deploy Flash applications](/flash/deploy-apps) for production use.
+- Explore more examples on the [runpod/flash-examples](https://github.com/runpod/flash-examples) GitHub repository.
diff --git a/flash/remote-functions.mdx b/flash/remote-functions.mdx
new file mode 100644
index 00000000..b8cc20bf
--- /dev/null
+++ b/flash/remote-functions.mdx
@@ -0,0 +1,262 @@
+---
+title: "Create remote functions"
+sidebarTitle: "Create remote functions"
+description: "Learn how to create and configure remote functions with Flash."
+tag: "BETA"
+---
+
+Remote functions are the core building blocks of Flash. The `@remote` decorator marks Python functions for execution on Runpod's Serverless infrastructure, handling resource provisioning, dependency installation, and data transfer automatically.
+
+## Resource configuration
+
+Every remote function requires a resource configuration that specifies the compute resources to use. Flash provides several configuration classes for different use cases.
+
+### LiveServerless
+
+`LiveServerless` is the primary configuration class for Flash. It supports full remote code execution, allowing you to run arbitrary Python functions on Runpod's infrastructure.
+
+```python
+from tetra_rp import LiveServerless, GpuGroup
+
+gpu_config = LiveServerless(
+ name="ml-inference",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ workersMax=5,
+ idleTimeout=10
+)
+
+@remote(resource_config=gpu_config, dependencies=["torch"])
+def run_inference(data):
+ import torch
+ # Your inference code here
+ return result
+```
+
+Common configuration options:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `name` | Name for your endpoint (required) | - |
+| `gpus` | GPU pool IDs that can be used | `[GpuGroup.ANY]` |
+| `workersMax` | Maximum number of workers | 3 |
+| `workersMin` | Minimum number of workers | 0 |
+| `idleTimeout` | Minutes before scaling down | 5 |
+
+See the [resource configuration reference](/flash/resource-configuration) for all available options.
+
+### CPU configuration
+
+For CPU-only workloads, specify `instanceIds` instead of `gpus`:
+
+```python
+from tetra_rp import LiveServerless, CpuInstanceType
+
+cpu_config = LiveServerless(
+ name="data-processor",
+ instanceIds=[CpuInstanceType.CPU5C_4_8], # 4 vCPU, 8GB RAM
+ workersMax=3
+)
+
+@remote(resource_config=cpu_config, dependencies=["pandas"])
+def process_data(data):
+ import pandas as pd
+ df = pd.DataFrame(data)
+ return df.describe().to_dict()
+```
+
+## Dependency management
+
+Specify Python packages in the `dependencies` parameter of the `@remote` decorator. Flash installs these packages on the remote worker before executing your function.
+
+```python
+@remote(
+ resource_config=config,
+ dependencies=["transformers==4.36.0", "torch", "pillow"]
+)
+def generate_image(prompt):
+ from transformers import pipeline
+ import torch
+ from PIL import Image
+ # Your code here
+```
+
+### Important notes about dependencies
+
+**Import inside the function**: Always import packages inside the decorated function body, not at the top of your file. These imports need to happen on the remote worker, not in your local environment.
+
+```python
+# Correct - imports inside the function
+@remote(resource_config=config, dependencies=["numpy"])
+def compute(data):
+ import numpy as np # Import here
+ return np.sum(data)
+
+# Incorrect - imports at top of file won't work
+import numpy as np # This import happens locally, not on the worker
+
+@remote(resource_config=config, dependencies=["numpy"])
+def compute(data):
+ return np.sum(data) # numpy not available on worker
+```
+
+**Version pinning**: You can pin specific versions using standard pip syntax:
+
+```python
+dependencies=["transformers==4.36.0", "torch>=2.0.0"]
+```
+
+**Pre-installed packages**: Some packages (like PyTorch) are pre-installed on GPU workers. Including them in dependencies ensures the correct version is available.
+
+## Parallel execution
+
+Flash functions are asynchronous by default. Use Python's `asyncio` to run multiple functions in parallel:
+
+```python
+import asyncio
+
+async def main():
+ # Run three functions in parallel
+ results = await asyncio.gather(
+ process_item(item1),
+ process_item(item2),
+ process_item(item3)
+ )
+ return results
+```
+
+This is particularly useful for:
+
+- Batch processing multiple inputs.
+- Running different models on the same data.
+- Parallelizing independent pipeline stages.
+
+### Example: Parallel batch processing
+
+```python
+import asyncio
+from tetra_rp import remote, LiveServerless, GpuGroup
+
+config = LiveServerless(
+ name="batch-processor",
+ gpus=[GpuGroup.ADA_24],
+ workersMax=5 # Allow up to 5 parallel workers
+)
+
+@remote(resource_config=config, dependencies=["torch"])
+def process_batch(batch_id, data):
+ import torch
+ # Process batch
+ return {"batch_id": batch_id, "result": len(data)}
+
+async def main():
+ batches = [
+ (1, [1, 2, 3]),
+ (2, [4, 5, 6]),
+ (3, [7, 8, 9])
+ ]
+
+ # Process all batches in parallel
+ results = await asyncio.gather(*[
+ process_batch(batch_id, data)
+ for batch_id, data in batches
+ ])
+
+ print(results)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Custom Docker images
+
+For specialized environments that require a custom Docker image, use `ServerlessEndpoint` or `CpuServerlessEndpoint` instead of `LiveServerless`:
+
+```python
+from tetra_rp import ServerlessEndpoint, GpuGroup
+
+custom_gpu = ServerlessEndpoint(
+ name="custom-ml-env",
+ imageName="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime",
+ gpus=[GpuGroup.AMPERE_80]
+)
+```
+
+
+
+Unlike `LiveServerless`, `ServerlessEndpoint` and `CpuServerlessEndpoint` only support dictionary payloads in the form of `{"input": {...}}` (similar to a traditional [Serverless endpoint request](/serverless/endpoints/send-requests)). They cannot execute arbitrary Python functions remotely.
+
+
+
+Use custom Docker images when you need:
+
+- Pre-installed system-level dependencies.
+- Specific CUDA or cuDNN versions.
+- Custom base images with large models baked in.
+
+## Using persistent storage
+
+Attach [network volumes](/storage/network-volumes) for persistent storage across workers and endpoints. This is useful for sharing large models or datasets between workers without downloading them each time.
+
+```python
+config = LiveServerless(
+ name="model-server",
+ networkVolumeId="vol_abc123", # Your network volume ID
+ template=PodTemplate(containerDiskInGb=100)
+)
+```
+
+To find your network volume ID:
+
+1. Go to the [Storage page](https://www.runpod.io/console/storage) in the Runpod console.
+2. Click on your network volume.
+3. Copy the volume ID from the URL or volume details.
+
+### Example: Using a network volume for model storage
+
+```python
+from tetra_rp import LiveServerless, GpuGroup, PodTemplate
+
+config = LiveServerless(
+ name="model-inference",
+ gpus=[GpuGroup.AMPERE_80],
+ networkVolumeId="vol_abc123",
+ template=PodTemplate(containerDiskInGb=100)
+)
+
+@remote(resource_config=config, dependencies=["torch", "transformers"])
+def run_inference(prompt):
+ from transformers import AutoModelForCausalLM, AutoTokenizer
+
+ # Load model from network volume
+ model_path = "/runpod-volume/models/llama-7b"
+ model = AutoModelForCausalLM.from_pretrained(model_path)
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+ # Run inference
+ inputs = tokenizer(prompt, return_tensors="pt")
+ outputs = model.generate(**inputs)
+ return tokenizer.decode(outputs[0])
+```
+
+## Environment variables
+
+Pass environment variables to remote functions using the `env` parameter:
+
+```python
+config = LiveServerless(
+ name="api-worker",
+ env={"HF_TOKEN": "your_token", "MODEL_ID": "gpt2"}
+)
+```
+
+
+
+Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, which allows different processes to load environment variables from `.env` files without causing false drift detection.
+
+
+
+## Next steps
+
+- [Create API endpoints](/flash/api-endpoints) using FastAPI.
+- [Deploy Flash applications](/flash/deploy-apps) for production.
+- [View the resource configuration reference](/flash/resource-configuration) for all available options.
diff --git a/flash/resource-configuration.mdx b/flash/resource-configuration.mdx
new file mode 100644
index 00000000..4623953e
--- /dev/null
+++ b/flash/resource-configuration.mdx
@@ -0,0 +1,269 @@
+---
+title: "Resource configuration reference"
+sidebarTitle: "Configuration reference"
+description: "Complete reference for Flash resource configuration options."
+tag: "BETA"
+---
+
+Flash provides several resource configuration classes for different use cases. This reference covers all available parameters and options.
+
+## LiveServerless
+
+`LiveServerless` is the primary configuration class for Flash. It supports full remote code execution, allowing you to run arbitrary Python functions on Runpod's infrastructure.
+
+```python
+from tetra_rp import LiveServerless, GpuGroup, CpuInstanceType, PodTemplate
+
+gpu_config = LiveServerless(
+ name="ml-inference",
+ gpus=[GpuGroup.AMPERE_80],
+ workersMax=5,
+ idleTimeout=10,
+ template=PodTemplate(containerDiskInGb=100)
+)
+```
+
+### Parameters
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `name` | `string` | Name for your endpoint (required) | - |
+| `gpus` | `list[GpuGroup]` | GPU pool IDs that can be used by workers | `[GpuGroup.ANY]` |
+| `gpuCount` | `int` | Number of GPUs per worker | 1 |
+| `instanceIds` | `list[CpuInstanceType]` | CPU instance types (forces CPU endpoint) | `None` |
+| `workersMin` | `int` | Minimum number of workers | 0 |
+| `workersMax` | `int` | Maximum number of workers | 3 |
+| `idleTimeout` | `int` | Minutes before scaling down | 5 |
+| `env` | `dict` | Environment variables | `None` |
+| `networkVolumeId` | `string` | Persistent storage volume ID | `None` |
+| `executionTimeoutMs` | `int` | Max execution time in milliseconds | 0 (no limit) |
+| `scalerType` | `string` | Scaling strategy | `QUEUE_DELAY` |
+| `scalerValue` | `int` | Scaling parameter value | 4 |
+| `locations` | `string` | Preferred datacenter locations | `None` |
+| `template` | `PodTemplate` | Pod template overrides | `None` |
+
+### GPU configuration example
+
+```python
+from tetra_rp import LiveServerless, GpuGroup, PodTemplate
+
+config = LiveServerless(
+ name="gpu-inference",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ gpuCount=1,
+ workersMin=0,
+ workersMax=5,
+ idleTimeout=10,
+ template=PodTemplate(containerDiskInGb=100),
+ env={"MODEL_ID": "llama-7b"}
+)
+```
+
+### CPU configuration example
+
+```python
+from tetra_rp import LiveServerless, CpuInstanceType
+
+config = LiveServerless(
+ name="cpu-processor",
+ instanceIds=[CpuInstanceType.CPU5C_4_8], # 4 vCPU, 8GB RAM
+ workersMax=3,
+ idleTimeout=5
+)
+```
+
+## ServerlessEndpoint
+
+`ServerlessEndpoint` is for GPU workloads that require custom Docker images. Unlike `LiveServerless`, it only supports dictionary payloads and cannot execute arbitrary Python functions.
+
+```python
+from tetra_rp import ServerlessEndpoint, GpuGroup
+
+config = ServerlessEndpoint(
+ name="custom-ml-env",
+ imageName="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime",
+ gpus=[GpuGroup.AMPERE_80]
+)
+```
+
+### Parameters
+
+All parameters from `LiveServerless` are available, plus:
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `imageName` | `string` | Custom Docker image | - |
+
+### Limitations
+
+- Only supports dictionary payloads in the form of `{"input": {...}}`.
+- Cannot execute arbitrary Python functions remotely.
+- Requires a custom Docker image with a handler that processes the input dictionary.
+
+### Example
+
+```python
+from tetra_rp import ServerlessEndpoint, GpuGroup
+
+# Custom image with pre-installed models
+config = ServerlessEndpoint(
+ name="stable-diffusion",
+ imageName="my-registry/stable-diffusion:v1.0",
+ gpus=[GpuGroup.AMPERE_24],
+ workersMax=3
+)
+
+# Send requests as dictionaries
+result = await config.run({
+ "input": {
+ "prompt": "A beautiful sunset over mountains",
+ "width": 512,
+ "height": 512
+ }
+})
+```
+
+## CpuServerlessEndpoint
+
+`CpuServerlessEndpoint` is for CPU workloads that require custom Docker images. Like `ServerlessEndpoint`, it only supports dictionary payloads.
+
+```python
+from tetra_rp import CpuServerlessEndpoint, CpuInstanceType
+
+config = CpuServerlessEndpoint(
+ name="data-processor",
+ imageName="python:3.11-slim",
+ instanceIds=[CpuInstanceType.CPU5C_4_8]
+)
+```
+
+### Parameters
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `name` | `string` | Name for your endpoint (required) | - |
+| `imageName` | `string` | Custom Docker image | - |
+| `instanceIds` | `list[CpuInstanceType]` | CPU instance types | - |
+| `workersMin` | `int` | Minimum number of workers | 0 |
+| `workersMax` | `int` | Maximum number of workers | 3 |
+| `idleTimeout` | `int` | Minutes before scaling down | 5 |
+| `env` | `dict` | Environment variables | `None` |
+| `networkVolumeId` | `string` | Persistent storage volume ID | `None` |
+| `executionTimeoutMs` | `int` | Max execution time in milliseconds | 0 (no limit) |
+
+## Resource class comparison
+
+| Feature | LiveServerless | ServerlessEndpoint | CpuServerlessEndpoint |
+|---------|----------------|--------------------|-----------------------|
+| Remote code execution | ✅ Full Python function execution | ❌ Dictionary payload only | ❌ Dictionary payload only |
+| Custom Docker images | ❌ Fixed optimized images | ✅ Any Docker image | ✅ Any Docker image |
+| Use case | Dynamic remote functions | Traditional API endpoints | Traditional CPU endpoints |
+| Function returns | Any Python object | Dictionary only | Dictionary only |
+| `@remote` decorator | Full functionality | Limited to payload passing | Limited to payload passing |
+
+## Available GPU types
+
+The `GpuGroup` enum provides access to GPU pools. Some common options:
+
+| GpuGroup | Description | VRAM |
+|----------|-------------|------|
+| `GpuGroup.ANY` | Any available GPU (default) | Varies |
+| `GpuGroup.ADA_24` | RTX 4090 | 24GB |
+| `GpuGroup.AMPERE_24` | RTX A5000, L4, RTX 3090 | 24GB |
+| `GpuGroup.AMPERE_48` | A40, RTX A6000 | 48GB |
+| `GpuGroup.AMPERE_80` | A100 80GB | 80GB |
+
+See [GPU types](/references/gpu-types#gpu-pools) for the complete list of available GPU pools.
+
+## Available CPU instance types
+
+The `CpuInstanceType` enum provides access to CPU configurations:
+
+### 3rd generation general purpose
+
+| CpuInstanceType | ID | vCPU | RAM |
+|-----------------|-----|------|-----|
+| `CPU3G_1_4` | cpu3g-1-4 | 1 | 4GB |
+| `CPU3G_2_8` | cpu3g-2-8 | 2 | 8GB |
+| `CPU3G_4_16` | cpu3g-4-16 | 4 | 16GB |
+| `CPU3G_8_32` | cpu3g-8-32 | 8 | 32GB |
+
+### 3rd generation compute-optimized
+
+| CpuInstanceType | ID | vCPU | RAM |
+|-----------------|-----|------|-----|
+| `CPU3C_1_2` | cpu3c-1-2 | 1 | 2GB |
+| `CPU3C_2_4` | cpu3c-2-4 | 2 | 4GB |
+| `CPU3C_4_8` | cpu3c-4-8 | 4 | 8GB |
+| `CPU3C_8_16` | cpu3c-8-16 | 8 | 16GB |
+
+### 5th generation compute-optimized
+
+| CpuInstanceType | ID | vCPU | RAM |
+|-----------------|-----|------|-----|
+| `CPU5C_1_2` | cpu5c-1-2 | 1 | 2GB |
+| `CPU5C_2_4` | cpu5c-2-4 | 2 | 4GB |
+| `CPU5C_4_8` | cpu5c-4-8 | 4 | 8GB |
+| `CPU5C_8_16` | cpu5c-8-16 | 8 | 16GB |
+
+## PodTemplate
+
+Use `PodTemplate` to configure additional pod settings:
+
+```python
+from tetra_rp import LiveServerless, PodTemplate
+
+config = LiveServerless(
+ name="custom-template",
+ template=PodTemplate(
+ containerDiskInGb=100,
+ env=[{"key": "PYTHONPATH", "value": "/workspace"}]
+ )
+)
+```
+
+### Parameters
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `containerDiskInGb` | `int` | Container disk size in GB | 20 |
+| `env` | `list[dict]` | Environment variables as key-value pairs | `None` |
+
+## Environment variables
+
+Environment variables can be set in two ways:
+
+### Using the `env` parameter
+
+```python
+config = LiveServerless(
+ name="api-worker",
+ env={"HF_TOKEN": "your_token", "MODEL_ID": "gpt2"}
+)
+```
+
+### Using PodTemplate
+
+```python
+config = LiveServerless(
+ name="api-worker",
+ template=PodTemplate(
+ env=[
+ {"key": "HF_TOKEN", "value": "your_token"},
+ {"key": "MODEL_ID", "value": "gpt2"}
+ ]
+ )
+)
+```
+
+
+
+Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, which allows different processes to load environment variables from `.env` files without causing false drift detection. Only structural changes (like GPU type, image, or template modifications) trigger endpoint updates.
+
+
+
+## Next steps
+
+- [Create remote functions](/flash/remote-functions) using these configurations.
+- [Deploy Flash applications](/flash/deploy-apps) for production.
+- [Learn about pricing](/flash/pricing) to optimize costs.