diff --git a/docs/llm/mbridge/optimized/index.md b/docs/llm/mbridge/optimized/index.md index 88630a37d5..90429b532d 100644 --- a/docs/llm/mbridge/optimized/index.md +++ b/docs/llm/mbridge/optimized/index.md @@ -1,4 +1,13 @@ # Deploy Megatron-Bridge LLMs by Exporting to Inference Optimized Libraries -**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM and vLLM is coming soon. Please check back for updates. +Export-Deploy supports optimizing and deploying Megatron-Bridge checkpoints using inference-optimized libraries such as vLLM and TensorRT-LLM. + +```{toctree} +:maxdepth: 1 +:titlesonly: + +vLLM +``` + +**Note:** Support for exporting and deploying Megatron-Bridge models with TensorRT-LLM is coming soon. Please check back for updates. diff --git a/docs/llm/mbridge/optimized/vllm.md b/docs/llm/mbridge/optimized/vllm.md new file mode 100644 index 0000000000..78c8645f14 --- /dev/null +++ b/docs/llm/mbridge/optimized/vllm.md @@ -0,0 +1,190 @@ +# Deploy Megatron-Bridge LLMs with vLLM and Triton Inference Server + +This section shows how to use scripts and APIs to export a Megatron-Bridge LLM to vLLM and deploy it with the NVIDIA Triton Inference Server. + +## Quick Example + +1. Follow the steps in the [Generate a Megatron-Bridge Checkpoint](../gen_mbridge_ckpt.md) to generate a Megatron-Bridge Llama checkpoint. + +2. In a terminal, go to the folder where the ``hf_llama31_8B_mbridge`` checkpoint is located. Pull down and run the Docker container image using the command shown below. Change the ``:vr`` tag to the version of the container you want to use: + + ```shell + docker pull nvcr.io/nvidia/nemo:vr + + docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ + -v ${PWD}/hf_llama31_8B_mbridge:/opt/checkpoints/hf_llama31_8B_mbridge/ \ + -w /opt/Export-Deploy \ + --name nemo-fw \ + nvcr.io/nvidia/nemo:vr + ``` + +3. Install vLLM by executing the following command inside the container if it is not available in the container: + + ```shell + cd /opt/Export-Deploy + uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt) + + ``` + +4. Run the following deployment script to verify that everything is working correctly. The script exports the Llama Megatron-Bridge checkpoint to vLLM and subsequently serves it on the Triton server: + + ```shell + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ + --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \ + --model_format megatron_bridge \ + --triton_model_name llama \ + --tensor_parallelism_size 1 + ``` + +5. If the test yields a shared memory-related error, increase the shared memory size using ``--shm-size`` (gradually by 50%, for example). + +6. In a separate terminal, access the running container as follows: + + ```shell + docker exec -it nemo-fw bash + ``` + +7. To send a query to the Triton server, run the following script: + + ```shell + python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py -mn llama -p "The capital of Canada is" -mat 50 + ``` + +## Use a Script to Deploy Megatron-Bridge LLMs on a Triton Server + +You can deploy a LLM from a Megatron-Bridge checkpoint on Triton using the provided script. + +### Export and Deploy a Megatron-Bridge LLM + +After executing the script, it will export the model to vLLM and then initiate the service on Triton. + +1. Start the container using the steps described in the **Quick Example** section. + +2. To begin serving the downloaded model, run the following script: + + ```shell + python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ + --model_path_id /opt/checkpoints/hf_llama31_8B_mbridge \ + --triton_model_name llama \ + --tensor_parallelism_size 1 + ``` + + The following parameters are defined in the ``deploy_vllm_triton.py`` script: + + - ``--model_path_id``: Path of a Megatron-Bridge checkpoint, or Hugging Face model ID or path. (Required) + - ``--tokenizer``: Tokenizer file if it is not provided in the checkpoint. (Optional) + - ``--lora_ckpt``: List of LoRA checkpoints in HF format. (Optional, can specify multiple) + - ``--tensor_parallelism_size``: Number of GPUs to use for tensor parallelism. Default is 1. + - ``--dtype``: Data type for the model in vLLM. Choices: "auto", "bfloat16", "float16", "float32". Default is "auto". + - ``--quantization``: Quantization method for vLLM. Choices: "awq", "gptq", "fp8". Default is None. + - ``--seed``: Random seed for reproducibility. Default is 0. + - ``--gpu_memory_utilization``: GPU memory utilization percentage for vLLM. Default is 0.9. + - ``--swap_space``: Size (GiB) of CPU memory per GPU to use as swap space. Default is 4. + - ``--cpu_offload_gb``: Size (GiB) of CPU memory to use for offloading model weights. Default is 0. + - ``--enforce_eager``: Whether to enforce eager execution. Default is False. + - ``--max_seq_len_to_capture``: Maximum sequence length covered by CUDA graphs. Default is 8192. + - ``--triton_model_name``: Name for the service/model on Triton. (Required) + - ``--triton_model_version``: Version for the service/model. Default is 1. + - ``--triton_port``: Port for the Triton server to listen for requests. Default is 8000. + - ``--triton_http_address``: HTTP address for the Triton server. Default is 0.0.0.0. + - ``--max_batch_size``: Maximum batch size of the model. Default is 8. + - ``--debug_mode``: Enable debug/verbose output. Default is False. + +3. Access the models with a Hugging Face token. + + If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you'll need to generate a Hugging Face token that has access to these models. Visit `Hugging Face `__ for more information. After you have the token, perform one of the following steps. + + - Log in to Hugging Face: + + ```shell + huggingface-cli login + ``` + + - Or, set the HF_TOKEN environment variable: + + ```shell + export HF_TOKEN=your_token_here + ``` + +## Supported LLMs + +Megatron-Bridge models are supported for export and deployment if they are listed as compatible in the [vLLM supported models list](https://docs.vllm.ai/en/v0.9.2/models/supported_models.html). + + +## Use NeMo Export and Deploy APIs to Export + +Up until now, we have used scripts for exporting and deploying LLM models. However, NeMo's deploy and export modules offer straightforward APIs for deploying models to Triton and exporting Megatron-Bridge checkpoints to vLLM. + + +### Export Megatron-Bridge LLMs + +You can use the APIs in the export module to export a Megatron-Bridge checkpoint to vLLM. The following code example assumes the ``hf_llama31_8B_mbridge`` checkpoint has already been downloaded and generated at the ``/opt/checkpoints/`` path. + +```python +from nemo_export.vllm_exporter import vLLMExporter + +checkpoint_file = "/opt/checkpoints/hf_llama31_8B_mbridge" + +exporter = vLLMExporter() +exporter.export( + model_path_id=checkpoint_file, + tensor_parallel_size=1, +) + +# The correct argument for output length is 'max_tokens', not 'max_output_len' +output = exporter.forward( + ["What is the best city in the world?"], + max_tokens=50, + top_k=1, + top_p=0.0, + temperature=1.0, +) +print("output: ", output) +``` + +Be sure to check the vLLMExporter class docstrings for details. + + +## How To Send a Query + +### Send a Query using the Script + +You can send queries to your deployed Megatron-Bridge LLM using the provided query script. This script allows you to interact with the model via HTTP requests, sending prompts and receiving generated responses directly from the Triton server. + +The example below demonstrates how to use the query script to send a prompt to your deployed model. You can customize the request with various parameters to control generation behavior, such as output length, sampling strategy, and more. For a full list of supported parameters, see below. + +```shell +python /opt/Export-Deploy/scripts/deploy/nlp/query_vllm.py --url "http://localhost:8000" --model_name llama --prompt "What is the capital of United States?" +``` + +**Additional parameters:** +- `--prompt_file`: Read prompt from a file instead of the command line +- `--max_tokens`: Maximum number of tokens to generate (default: 16) +- `--min_tokens`: Minimum number of tokens to generate (default: 0) +- `--n_log_probs`: Number of log probabilities to return per output token +- `--n_prompt_log_probs`: Number of log probabilities to return per prompt token +- `--seed`: Random seed for generation +- `--top_k`: Top-k sampling (default: 1) +- `--top_p`: Top-p sampling (default: 0.1) +- `--temperature`: Sampling temperature (default: 1.0) +- `--lora_task_uids`: List of LoRA task UIDs for LoRA-enabled models (use -1 to disable) +- `--init_timeout`: Init timeout for the Triton server in seconds (default: 60.0) + + +### Send a Query using the NeMo APIs + +Please see the below if you would like to use APIs to send a query. + +```python +from nemo_deploy.nlp import NemoQueryvLLM + +nq = NemoQueryvLLM(url="localhost:8000", model_name="llama") +output = nq.query_llm( + prompts=["What is the capital of United States? "], + max_tokens=100, + top_k=1, + top_p=0.8, + temperature=1.0, +) +print("output: ", output) +```