diff --git a/docs/en/advanced/pd-disaggregation.md b/docs/en/advanced/pd-disaggregation.md index a4bca69bb..f509d72d8 100644 --- a/docs/en/advanced/pd-disaggregation.md +++ b/docs/en/advanced/pd-disaggregation.md @@ -1,5 +1,5 @@ # PD Disaggregation -Miles supports Prefill and Decode disaggregation (PD Disaggregation). +miles supports Prefill and Decode disaggregation (PD Disaggregation). You can set the number of servers used for Prefill by setting the `--prefill-num-servers` argument. diff --git a/docs/en/examples/glm4-9B.md b/docs/en/examples/glm4-9B.md index 8f9bff6e7..36629568f 100644 --- a/docs/en/examples/glm4-9B.md +++ b/docs/en/examples/glm4-9B.md @@ -8,7 +8,7 @@ After pulling the `radixark/miles:latest` image, initialize the image environmen cd /root/ git clone https://github.com/radixark/miles.git cd miles/ -pip install -e . +pip install -e . --no-deps ``` Download the model and data: diff --git a/docs/en/examples/qwen3-30B-A3B.md b/docs/en/examples/qwen3-30B-A3B.md index 35be2773a..2f731d461 100644 --- a/docs/en/examples/qwen3-30B-A3B.md +++ b/docs/en/examples/qwen3-30B-A3B.md @@ -9,7 +9,7 @@ To convert huggingface checkpoint to torch_dist, please try: ```bash cd miles/ -pip install -e . +pip install -e . --no-deps source scripts/models/qwen3-30B-A3B.sh PYTHONPATH=/root/Megatron-LM/ torchrun --nproc-per-node 8 \ tools/convert_hf_to_torch_dist.py \ diff --git a/docs/en/examples/qwen3-4B.md b/docs/en/examples/qwen3-4B.md index d66df38f3..8374de4be 100644 --- a/docs/en/examples/qwen3-4B.md +++ b/docs/en/examples/qwen3-4B.md @@ -8,7 +8,7 @@ After pulling the `radixark/miles:latest` image, initialize the image environmen cd /root/ git clone https://github.com/radixark/miles.git cd miles/ -pip install -e . +pip install -e . --no-deps ``` Download the model and data: diff --git a/docs/en/get_started/quick_start.md b/docs/en/get_started/quick_start.md index 08c41caf2..180075ed3 100644 --- a/docs/en/get_started/quick_start.md +++ b/docs/en/get_started/quick_start.md @@ -45,7 +45,7 @@ miles is already installed in the docker image. To update to the latest verison, # Path can be adjusted according to actual situation cd /root/miles git pull -pip install -e . +pip install -e . --no-deps ``` ## Model and Dataset Download diff --git a/docs/en/get_started/usage.md b/docs/en/get_started/usage.md index 9738131c7..97f6449fe 100644 --- a/docs/en/get_started/usage.md +++ b/docs/en/get_started/usage.md @@ -187,7 +187,7 @@ Additionally, we provide a `metadata_key`, which defaults to `"metadata"`. When - `reinforce_plus_plus` and `reinforce_plus_plus_baseline` ([https://arxiv.org/abs/2501.03262](https://arxiv.org/abs/2501.03262)) - `ppo` ([https://arxiv.org/abs/1707.06347](https://arxiv.org/abs/1707.06347)) - `on_policy_distillation` -- `--calculate-per-token-loss`: By default, Miles calculates loss on a per-sample basis, i.e., `mean(sum(sample_i) / len(sample_i))`. Enable this flag to calculate loss on a per-token basis, i.e., `sum(sum(sample_i)) / sum(len(sample_i))`. +- `--calculate-per-token-loss`: By default, miles calculates loss on a per-sample basis, i.e., `mean(sum(sample_i) / len(sample_i))`. Enable this flag to calculate loss on a per-token basis, i.e., `sum(sum(sample_i)) / sum(len(sample_i))`. - `--use-tis`: Enable this setting to use TIS (Truncated Importance Sampling) (https://fengyao.notion.site/off-policy-rl). - `--true-on-policy-mode`: Enable True On-Policy mode, which strictly ensures that data is generated by the current policy during training. @@ -374,7 +374,7 @@ hf download --repo-type dataset zhuzilin/aime-2024 \ # Clone code and install dependencies git clone https://github.com/radixark/miles.git cd miles -pip install -e . +pip install -e . --no-deps # FSDP does not require weight conversion, natively supports huggingface format diff --git a/docs/en/platform_support/amd_tutorial.md b/docs/en/platform_support/amd_tutorial.md index 53df91b09..c22fdb2ae 100644 --- a/docs/en/platform_support/amd_tutorial.md +++ b/docs/en/platform_support/amd_tutorial.md @@ -54,7 +54,7 @@ Then, download and install miles: ```bash git clone https://github.com/radixark/miles.git cd miles -pip install -e . +pip install -e . --no-deps ``` Download the model and data: @@ -93,7 +93,7 @@ PYTHONPATH=${MEGATRON_LM_PATH} python tools/convert_hf_to_torch_dist.py \ Note: We implemented a dedicated AMD conversion script that forces a CPU-only conversion workflow using the Gloo backend to bypass hardware-specific issues. A GPU-based script for ROCm is currently in development. -⚠️ If you encounter an issue where miles cannot be found, please run `pip install -e .` in the miles directory. +⚠️ If you encounter an issue where miles cannot be found, please run `pip install -e . --no-deps` in the miles directory. ### Example: Qwen3-4B diff --git a/examples/README.md b/examples/README.md index c38642fbd..88f135241 100644 --- a/examples/README.md +++ b/examples/README.md @@ -1,6 +1,6 @@ # Examples -These examples provide concrete examples to leverage Miles in your own RL workflow. Some examples are just demonstrative, but most of them are verifiable with a concrete performance score. +These examples provide concrete examples to leverage miles in your own RL workflow. Some examples are just demonstrative, but most of them are verifiable with a concrete performance score. ## Directory Structure diff --git a/examples/eval/README.md b/examples/eval/README.md index adafe9a5b..4271a8c7d 100644 --- a/examples/eval/README.md +++ b/examples/eval/README.md @@ -4,10 +4,10 @@ This directory contains configuration and utilities for offloading complex evalu ## Overview -The setup allows Miles to delegate evaluation tasks to a dedicated "Skills" server. This creates a clear separation of concerns: +The setup allows miles to delegate evaluation tasks to a dedicated "Skills" server. This creates a clear separation of concerns: -1. **Miles Container**: Runs the main training loop and hosts the model using SGLang. -2. **Skills Container**: Hosts the `nemo_skills` environment, runs the evaluation logic, and queries the model running in the Miles container. +1. **miles Container**: Runs the main training loop and hosts the model using SGLang. +2. **Skills Container**: Hosts the `nemo_skills` environment, runs the evaluation logic, and queries the model running in the miles container. ## Prerequisites @@ -18,15 +18,15 @@ The setup allows Miles to delegate evaluation tasks to a dedicated "Skills" serv ### Prepare Host Network -Create a Docker network to allow communication between the Miles and Skills containers. +Create a Docker network to allow communication between the miles and Skills containers. ```bash docker network create skills-net ``` -### Launch the Miles Container +### Launch the miles Container -Start the main container where Miles and the model will run. Replace `` with your desired name (e.g., `miles_main`). +Start the main container where miles and the model will run. Replace `` with your desired name (e.g., `miles_main`). ```bash docker run \ @@ -76,7 +76,7 @@ git clone -b miles https://github.com/guapisolo/Skills.git /opt/Skills # Install Skills package cd /opt/Skills -pip install -e . +pip install -e . --no-deps ``` **b) Prepare Datasets** @@ -92,7 +92,7 @@ python3 arena-hard/prepare.py **c) Start the Evaluation Server** -Start the server that listens for evaluation requests from Miles. +Start the server that listens for evaluation requests from miles. ```bash cd /opt/miles @@ -105,20 +105,20 @@ python examples/eval/nemo_skills/skills_server.py \ --max-concurrent-requests 512 \ --openai-model-name miles-openai-model ``` -*Note: You can now connect to the server at `skills_server:9050` from within the `skills-net` Docker network. The server always proxies evaluation traffic to an OpenAI-compatible sglang router (Miles starts and manage the router), so adjust `--openai-model-name` and `--max-concurrent-requests` as needed for your deployment. +*Note: You can now connect to the server at `skills_server:9050` from within the `skills-net` Docker network. The server always proxies evaluation traffic to an OpenAI-compatible sglang router (miles starts and manage the router), so adjust `--openai-model-name` and `--max-concurrent-requests` as needed for your deployment. ## Running Evaluation The example scripts are located in `examples/eval/scripts`. Here is an example workflow for training Qwen3-4B with delegated evaluation. -### Prepare Miles Container +### Prepare miles Container -Enter the **Miles container** and install the package. +Enter the **miles container** and install the package. ```bash cd /root/miles git pull -pip install -e . +pip install -e . --no-deps ``` ### Download Model and Data diff --git a/examples/low_precision/README.md b/examples/low_precision/README.md index 520ea5d5c..5bb90442a 100644 --- a/examples/low_precision/README.md +++ b/examples/low_precision/README.md @@ -89,7 +89,7 @@ This guide provides examples for INT4 STE (Straight-Through Estimator) training First, download the PTQ (Post-Training Quantization) calibration dataset from HuggingFace: [https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1) -Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. Miles will automatically utilize INT4 quantization during weight updates. +Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. miles will automatically utilize INT4 quantization during weight updates. ```bash python tools/convert_hf_to_hf_int4.py \ diff --git a/examples/on_policy_distillation/README.md b/examples/on_policy_distillation/README.md index ff7a8207b..a6b22c5b1 100644 --- a/examples/on_policy_distillation/README.md +++ b/examples/on_policy_distillation/README.md @@ -1,6 +1,6 @@ # On-Policy Distillation Example -This example shows how to run **on-policy distillation** using Miles. A small student (Qwen3-8B) is aligned to imitate a larger teacher (Qwen3-32B) by training only on the student's own rollouts and matching the teacher's token-level log-probabilities. +This example shows how to run **on-policy distillation** using miles. A small student (Qwen3-8B) is aligned to imitate a larger teacher (Qwen3-32B) by training only on the student's own rollouts and matching the teacher's token-level log-probabilities. In this example, the teacher model acts as a reward model (RM) by providing teacher log probabilities as the supervision signal. @@ -50,7 +50,7 @@ Using Qwen3-8B-Base model sfted on part of the [OpenThoughts3-1.2M](https://hugg # FAQ 1. **Why are teacher logits computed via a sglang server instead of inside the training backend?** -The teacher runs on an independent SGLang server that Miles treats as a reward model. Hosting it inside Megatron/FSDP would require maintaining a second, fully configured training stack for the teacher. +The teacher runs on an independent SGLang server that miles treats as a reward model. Hosting it inside Megatron/FSDP would require maintaining a second, fully configured training stack for the teacher. # References diff --git a/examples/retool/README.md b/examples/retool/README.md index b4e3f71eb..bd9af717b 100644 --- a/examples/retool/README.md +++ b/examples/retool/README.md @@ -21,7 +21,7 @@ The retool example provides: 1. Setup and download datasets: ```bash cd miles -pip install -e . +pip install -e . --no-deps # For SFT part, you can use later model to RL directly and skip SFT. hf download --repo-type dataset JoeYing/ReTool-SFT --local-dir /root/JoeYing/ReTool-SFT hf download Qwen/Qwen3-4B-Instruct-2507 --local-dir /root/Qwen/Qwen3-4B-Instruct-2507 diff --git a/examples/search-r1/README.md b/examples/search-r1/README.md index b9c426f8b..867ca504b 100644 --- a/examples/search-r1/README.md +++ b/examples/search-r1/README.md @@ -9,7 +9,7 @@ Use the `radixark/miles:latest` image and initialize the environment required fo ```bash cd /root/ git clone https://github.com/radixark/miles.git -pip install -e . +pip install -e . --no-deps # for Search R1 pip install chardet ``` diff --git a/examples/strands_sglang/README.md b/examples/strands_sglang/README.md index 09c4111af..ad310238a 100644 --- a/examples/strands_sglang/README.md +++ b/examples/strands_sglang/README.md @@ -1,4 +1,4 @@ -# Miles x Strands-SGLang +# miles x Strands-SGLang This example connects `miles` with [`strands-sglang`](https://github.com/horizon-rl/strands-sglang) (SGLang extension for the agentic scaffolding [`strands`](https://github.com/strands-agents/sdk-python)) for agentic RL training. @@ -20,7 +20,7 @@ This example connects `miles` with [`strands-sglang`](https://github.com/horizon 1. Pull the `radixark/miles:latest` image and enter it 2. Go to miles folder: `cd /root/miles` -3. Install Miles: `pip install -e .` +3. Install miles: `pip install -e . --no-deps` 4. Go to the example folder: `cd /root/miles/examples/strands_sglang` 5. Install other dependencies: `pip install -r requirements.txt` diff --git a/examples/tau-bench/README.md b/examples/tau-bench/README.md index 524157b75..417275c1b 100644 --- a/examples/tau-bench/README.md +++ b/examples/tau-bench/README.md @@ -9,13 +9,13 @@ Use the `zhuzilin/miles:latest` image and initialize the environment required fo cd /root/ git clone https://github.com/radixark/miles.git cd miles -pip install -e . +pip install -e . --no-deps # for tau bench cd /root/ git clone https://github.com/JD-ETH/tau-bench.git cd tau-bench git checkout feature/litellm-retry -pip install -e . +pip install -e . --no-deps ``` Use the following script to generate mock data for miles training.