Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/en/advanced/pd-disaggregation.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# PD Disaggregation

Miles supports Prefill and Decode disaggregation (PD Disaggregation).
miles supports Prefill and Decode disaggregation (PD Disaggregation).

You can set the number of servers used for Prefill by setting the `--prefill-num-servers` argument.
2 changes: 1 addition & 1 deletion docs/en/examples/glm4-9B.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ After pulling the `radixark/miles:latest` image, initialize the image environmen
cd /root/
git clone https://github.com/radixark/miles.git
cd miles/
pip install -e .
pip install -e . --no-deps
```

Download the model and data:
Expand Down
2 changes: 1 addition & 1 deletion docs/en/examples/qwen3-30B-A3B.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ To convert huggingface checkpoint to torch_dist, please try:

```bash
cd miles/
pip install -e .
pip install -e . --no-deps
source scripts/models/qwen3-30B-A3B.sh
PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
Expand Down
2 changes: 1 addition & 1 deletion docs/en/examples/qwen3-4B.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ After pulling the `radixark/miles:latest` image, initialize the image environmen
cd /root/
git clone https://github.com/radixark/miles.git
cd miles/
pip install -e .
pip install -e . --no-deps
```

Download the model and data:
Expand Down
2 changes: 1 addition & 1 deletion docs/en/get_started/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ miles is already installed in the docker image. To update to the latest verison,
# Path can be adjusted according to actual situation
cd /root/miles
git pull
pip install -e .
pip install -e . --no-deps
```

## Model and Dataset Download
Expand Down
4 changes: 2 additions & 2 deletions docs/en/get_started/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ Additionally, we provide a `metadata_key`, which defaults to `"metadata"`. When
- `reinforce_plus_plus` and `reinforce_plus_plus_baseline` ([https://arxiv.org/abs/2501.03262](https://arxiv.org/abs/2501.03262))
- `ppo` ([https://arxiv.org/abs/1707.06347](https://arxiv.org/abs/1707.06347))
- `on_policy_distillation`
- `--calculate-per-token-loss`: By default, Miles calculates loss on a per-sample basis, i.e., `mean(sum(sample_i) / len(sample_i))`. Enable this flag to calculate loss on a per-token basis, i.e., `sum(sum(sample_i)) / sum(len(sample_i))`.
- `--calculate-per-token-loss`: By default, miles calculates loss on a per-sample basis, i.e., `mean(sum(sample_i) / len(sample_i))`. Enable this flag to calculate loss on a per-token basis, i.e., `sum(sum(sample_i)) / sum(len(sample_i))`.
- `--use-tis`: Enable this setting to use TIS (Truncated Importance Sampling) (https://fengyao.notion.site/off-policy-rl).
- `--true-on-policy-mode`: Enable True On-Policy mode, which strictly ensures that data is generated by the current policy during training.

Expand Down Expand Up @@ -374,7 +374,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \
# Clone code and install dependencies
git clone https://github.com/radixark/miles.git
cd miles
pip install -e .
pip install -e . --no-deps


# FSDP does not require weight conversion, natively supports huggingface format
Expand Down
4 changes: 2 additions & 2 deletions docs/en/platform_support/amd_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Then, download and install miles:
```bash
git clone https://github.com/radixark/miles.git
cd miles
pip install -e .
pip install -e . --no-deps
```

Download the model and data:
Expand Down Expand Up @@ -93,7 +93,7 @@ PYTHONPATH=${MEGATRON_LM_PATH} python tools/convert_hf_to_torch_dist.py \

Note: We implemented a dedicated AMD conversion script that forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development.

⚠️ If you encounter an issue where miles cannot be found, please run `pip install -e .` in the miles directory.
⚠️ If you encounter an issue where miles cannot be found, please run `pip install -e . --no-deps` in the miles directory.


### Example: Qwen3-4B
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Examples

These examples provide concrete examples to leverage Miles in your own RL workflow. Some examples are just demonstrative, but most of them are verifiable with a concrete performance score.
These examples provide concrete examples to leverage miles in your own RL workflow. Some examples are just demonstrative, but most of them are verifiable with a concrete performance score.

## Directory Structure

Expand Down
24 changes: 12 additions & 12 deletions examples/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ This directory contains configuration and utilities for offloading complex evalu

## Overview

The setup allows Miles to delegate evaluation tasks to a dedicated "Skills" server. This creates a clear separation of concerns:
The setup allows miles to delegate evaluation tasks to a dedicated "Skills" server. This creates a clear separation of concerns:

1. **Miles Container**: Runs the main training loop and hosts the model using SGLang.
2. **Skills Container**: Hosts the `nemo_skills` environment, runs the evaluation logic, and queries the model running in the Miles container.
1. **miles Container**: Runs the main training loop and hosts the model using SGLang.
2. **Skills Container**: Hosts the `nemo_skills` environment, runs the evaluation logic, and queries the model running in the miles container.

## Prerequisites

Expand All @@ -18,15 +18,15 @@ The setup allows Miles to delegate evaluation tasks to a dedicated "Skills" serv

### Prepare Host Network

Create a Docker network to allow communication between the Miles and Skills containers.
Create a Docker network to allow communication between the miles and Skills containers.

```bash
docker network create skills-net
```

### Launch the Miles Container
### Launch the miles Container

Start the main container where Miles and the model will run. Replace `<miles container name>` with your desired name (e.g., `miles_main`).
Start the main container where miles and the model will run. Replace `<miles container name>` with your desired name (e.g., `miles_main`).

```bash
docker run \
Expand Down Expand Up @@ -76,7 +76,7 @@ git clone -b miles https://github.com/guapisolo/Skills.git /opt/Skills

# Install Skills package
cd /opt/Skills
pip install -e .
pip install -e . --no-deps
```

**b) Prepare Datasets**
Expand All @@ -92,7 +92,7 @@ python3 arena-hard/prepare.py

**c) Start the Evaluation Server**

Start the server that listens for evaluation requests from Miles.
Start the server that listens for evaluation requests from miles.

```bash
cd /opt/miles
Expand All @@ -105,20 +105,20 @@ python examples/eval/nemo_skills/skills_server.py \
--max-concurrent-requests 512 \
--openai-model-name miles-openai-model
```
*Note: You can now connect to the server at `skills_server:9050` from within the `skills-net` Docker network. The server always proxies evaluation traffic to an OpenAI-compatible sglang router (Miles starts and manage the router), so adjust `--openai-model-name` and `--max-concurrent-requests` as needed for your deployment.
*Note: You can now connect to the server at `skills_server:9050` from within the `skills-net` Docker network. The server always proxies evaluation traffic to an OpenAI-compatible sglang router (miles starts and manage the router), so adjust `--openai-model-name` and `--max-concurrent-requests` as needed for your deployment.

## Running Evaluation

The example scripts are located in `examples/eval/scripts`. Here is an example workflow for training Qwen3-4B with delegated evaluation.

### Prepare Miles Container
### Prepare miles Container

Enter the **Miles container** and install the package.
Enter the **miles container** and install the package.

```bash
cd /root/miles
git pull
pip install -e .
pip install -e . --no-deps
```

### Download Model and Data
Expand Down
2 changes: 1 addition & 1 deletion examples/low_precision/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ This guide provides examples for INT4 STE (Straight-Through Estimator) training
First, download the PTQ (Post-Training Quantization) calibration dataset from HuggingFace:
[https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1)

Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. Miles will automatically utilize INT4 quantization during weight updates.
Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. miles will automatically utilize INT4 quantization during weight updates.

```bash
python tools/convert_hf_to_hf_int4.py \
Expand Down
4 changes: 2 additions & 2 deletions examples/on_policy_distillation/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# On-Policy Distillation Example

This example shows how to run **on-policy distillation** using Miles. A small student (Qwen3-8B) is aligned to imitate a larger teacher (Qwen3-32B) by training only on the student's own rollouts and matching the teacher's token-level log-probabilities.
This example shows how to run **on-policy distillation** using miles. A small student (Qwen3-8B) is aligned to imitate a larger teacher (Qwen3-32B) by training only on the student's own rollouts and matching the teacher's token-level log-probabilities.

In this example, the teacher model acts as a reward model (RM) by providing teacher log probabilities as the supervision signal.

Expand Down Expand Up @@ -50,7 +50,7 @@ Using Qwen3-8B-Base model sfted on part of the [OpenThoughts3-1.2M](https://hugg

# FAQ
1. **Why are teacher logits computed via a sglang server instead of inside the training backend?**
The teacher runs on an independent SGLang server that Miles treats as a reward model. Hosting it inside Megatron/FSDP would require maintaining a second, fully configured training stack for the teacher.
The teacher runs on an independent SGLang server that miles treats as a reward model. Hosting it inside Megatron/FSDP would require maintaining a second, fully configured training stack for the teacher.


# References
Expand Down
2 changes: 1 addition & 1 deletion examples/retool/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The retool example provides:
1. Setup and download datasets:
```bash
cd miles
pip install -e .
pip install -e . --no-deps
# For SFT part, you can use later model to RL directly and skip SFT.
hf download --repo-type dataset JoeYing/ReTool-SFT --local-dir /root/JoeYing/ReTool-SFT
hf download Qwen/Qwen3-4B-Instruct-2507 --local-dir /root/Qwen/Qwen3-4B-Instruct-2507
Expand Down
2 changes: 1 addition & 1 deletion examples/search-r1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Use the `radixark/miles:latest` image and initialize the environment required fo
```bash
cd /root/
git clone https://github.com/radixark/miles.git
pip install -e .
pip install -e . --no-deps
# for Search R1
pip install chardet
```
Expand Down
4 changes: 2 additions & 2 deletions examples/strands_sglang/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Miles x Strands-SGLang
# miles x Strands-SGLang

This example connects `miles` with [`strands-sglang`](https://github.com/horizon-rl/strands-sglang) (SGLang extension for the agentic scaffolding [`strands`](https://github.com/strands-agents/sdk-python)) for agentic RL training.

Expand All @@ -20,7 +20,7 @@ This example connects `miles` with [`strands-sglang`](https://github.com/horizon

1. Pull the `radixark/miles:latest` image and enter it
2. Go to miles folder: `cd /root/miles`
3. Install Miles: `pip install -e .`
3. Install miles: `pip install -e . --no-deps`
4. Go to the example folder: `cd /root/miles/examples/strands_sglang`
5. Install other dependencies: `pip install -r requirements.txt`

Expand Down
4 changes: 2 additions & 2 deletions examples/tau-bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ Use the `zhuzilin/miles:latest` image and initialize the environment required fo
cd /root/
git clone https://github.com/radixark/miles.git
cd miles
pip install -e .
pip install -e . --no-deps
# for tau bench
cd /root/
git clone https://github.com/JD-ETH/tau-bench.git
cd tau-bench
git checkout feature/litellm-retry
pip install -e .
pip install -e . --no-deps
```

Use the following script to generate mock data for miles training.
Expand Down
Loading