APEX+ is an extensible and dynamism-aware simulator for Automated Parallel EXecution in LLM serving. Given an LLM, a device cluster, and input requests with varying context/generation lengths, APEX+ generates an optimal parallel execution plan for LLM serving. APEX+ performs dynamism-aware simulation to model iteration-level batching, and leverages LLMs' repetitive structure to reduce design space, scaling efficiently to trillion-scale models. APEX+ finds plans up to 3.37x faster than heuristics, and also plans that reduce energy consumption by up to 45% compared to latency-optimal plans. APEX+ performs comprehensive evaluations, reporting key system metrics like time per output token (TPOT) and time to first token (TTFT), which can help service providers meet SLOs. APEX+ identifies an optimal plan within 15 minutes on a CPU, making it 71x faster and 1234x more cost-effective than cloud-based GPU deployment.
Currently, APEX+ includes op-level profiling data for the following backends: V100-16GB, H100 SXM 80GB, and H200 SXM 141 GB.
To experiment on different cluster backends, one need to create a folder of the backend name (e.g., H100-SXM-80GB) and store the profiling files in profile/comp/{Backend} and profile/comm/{Backend}. Note that APEX+ also supports hardware backends other than GPUs, as long as their op-level profiling data is provided. Detailed instructions to obtain the profiling data are in the README under the profile folder.
Install the dependencies by running:
pip install -r requirements.txtRun main.py in the root directory of the repository. We can simulate various models:
# Simulate the decoder-only model llama3-70b; the ``--all`` flag prints the simulation results of all the execution plans, otherwise only the latency-optimal plan is printed
python main.py --model llama3-70b --num-gpus-per-node 2 --prompt-len 128 --output-len 2048 --all
# Simulate the encoder-decoder model Whisper
python main.py --model whisper --num-gpus-per-node 2 --prompt-len 128 --output-len 2048
# Simulate with a trace file
python main.py --model llama3-70b --num-gpus-per-node 2 --trace-file ./traces/llama/creation_05.jsonl
# Simulate with quantization mode W8A16 and KV-cache in FP8
python main.py --model llama3-70b --num-gpus-per-node 2 --trace-file ./traces/llama/creation_05.jsonl --kv-dtype float8 --weight-dtype float8 --activation-dtype half
# Simulate with a MoE Model
python main.py --model mixtral-8x7b-local --num-gpus-per-node 4 --num-experts 8 --trace-file ./traces/mistral/lmsys_05.jsonlNote: APEX+ supports simulation on real request traces. The traces should be stored in
.jsonlformat, and include the following items: StartTimeOffset (offset from the first request, ns), ContextTokens (input sequence length), and GeneratedTokens (output sequence length).
Note 2: For a full list of supported LLMs, please seemain.py.
To compare the simulation results with actual LLM serving behavior, we use vLLM to serve dense LLMs, and SGLang to serve MoE models. The instructions to setup and run vLLM/SGLang experiments are in the README.md files of their respective folders. We also include the experimental results of serving several LLMs on a cluster with eight H100-SXM-80GB GPUs in a folder named results.
Using Llama3-70b as an example, the output will look like:
Namespace(model='./apex_plus/models/llama3_70b_config.json', num_experts=None, topk=2, capacity_factor=1.0, num_nodes=1, num_gpus_per_node=2, gpu='H100-SXM-80GB', frequency=1980, trace_file='./traces/simplified_irregular_trace.jsonl', prompt_len=2048, output_len=128, num_requests=1024, disable_ray=False, kv_dtype='half', weight_dtype='half', activation_dtype='half', all=True, request_percentiles=[], token_percentiles=[])
<bound method LLaMA3.__repr__ of LLaMA3(vocab_size=128256, num_layers=80, num_heads=64, num_kv_heads=8, hidden_size=8192, intermediate_size=28672)>
Generated 3 decoder candidate plans.
================================================================================
* Parallel schedule 0 for decoder:
# Model replicas: 1
# Stages: 1
# Blocks per stage: 80
--------------------------------------------------------------------------------
Name Value
--------------------------------------------------------------------------------
MQA 1 replicas, TaskMapping(tasks_per_device={'MQAHead': 32})x2
AllReduce 2 devices
SwiGLU 1 replicas, TaskMapping(tasks_per_device={'SwiGLUFilter':
14336})x2
AllReduce 2 devices
AllGather 2 devices
--------------------------------------------------------------------------------
* Statistics:
--------------------------------------------------------
Name Value
--------------------------------------------------------
Parameter size per device (GB) 63.8
Activation memory per device (GB) 16.2
Avg. requests per iteration (per microbatch) 14.4
Avg. tokens per iteration (per microbatch) 96.8
--------------------------------------------------------
* Performance Metrics:
--------------------------------------------------------------------------------
Name Value Units
--------------------------------------------------------------------------------
Throughput: Avg. Tokens generated per second 336.752 tokens/sec
Throughput: Avg. Tokens processed per second 2266.710 tokens/sec
Throughput: Requests per second 1.611 requests/sec
Latency: Avg. Time to first token (TTFT in msec) 200.707 msec
Latency: Avg. Time per output token (TPOT in msec) 44.064 msec
Avg. TBT Percentile: P50 42.578 msec
Avg. TBT Percentile: P95 54.281 msec
Request Completion Latency: 50th percentile 7.749 sec
Request Completion Latency: 95th percentile 20.692 sec
Avg. MFU Per iteration 6.591 %
MBU 43.646 %
--------------------------------------------------------------------------------
* Time breakdown:
---------------------------------------
Name Time (sec) Ratio (%)
---------------------------------------
MQA 17.23 ± 0.0 27.8
AllReduce 3.56 ± 0.0 5.7
SwiGLU 39.43 ± 0.0 63.5
AllGather 1.87 ± 0.0 3.0
Total 62.06 100.0
---------------------------------------
Energy Consumption: 75.69 KJ
================================================================================
* Parallel schedule 1 for decoder:
# Model replicas: 1
# Stages: 1
# Blocks per stage: 80
------------------------------------------------------------------------------------
Name Value
------------------------------------------------------------------------------------
MQA 2 replicas, TaskMapping(tasks_per_device={'MQAHead': 64})x1
ReduceScatter 1 devices
AllGather 2 devices
SwiGLU 1 replicas, TaskMapping(tasks_per_device={'SwiGLUFilter':
14336})x2
AllReduce 2 devices
AllToAll 2 devices
AllGather 1 devices
------------------------------------------------------------------------------------
* Statistics:
--------------------------------------------------------
Name Value
--------------------------------------------------------
Parameter size per device (GB) 75.0
Activation memory per device (GB) 5.0
Avg. requests per iteration (per microbatch) 15.4
Avg. tokens per iteration (per microbatch) 104.1
--------------------------------------------------------
* Performance Metrics:
--------------------------------------------------------------------------------
Name Value Units
--------------------------------------------------------------------------------
Throughput: Avg. Tokens generated per second 243.242 tokens/sec
Throughput: Avg. Tokens processed per second 1637.286 tokens/sec
Throughput: Requests per second 1.164 requests/sec
Latency: Avg. Time to first token (TTFT in msec) 348.100 msec
Latency: Avg. Time per output token (TPOT in msec) 50.885 msec
Avg. TBT Percentile: P50 50.873 msec
Avg. TBT Percentile: P95 55.442 msec
Request Completion Latency: 50th percentile 9.138 sec
Request Completion Latency: 95th percentile 19.982 sec
Avg. MFU Per iteration 4.761 %
MBU 37.795 %
--------------------------------------------------------------------------------
* Time breakdown:
---------------------------------------
Name Time (sec) Ratio (%)
---------------------------------------
MQA 25.80 ± 1.9 30.0
AllGather 1.32 ± 0.1 1.5
SwiGLU 38.05 ± 2.6 44.3
AllReduce 1.67 ± 0.1 1.9
AllToAll 1.68 ± 0.1 2.0
Idle 17.39 ± 4.8 20.2
Total 85.92 100.0
---------------------------------------
Energy Consumption: 85.91 KJ
================================================================================
* Parallel schedule 2 for decoder:
# Model replicas: 1
# Stages: 2
# Blocks per stage: 40
--------------------------------------------------------------------------------
Name Value
--------------------------------------------------------------------------------
MQA 1 replicas, TaskMapping(tasks_per_device={'MQAHead': 64})x1
AllReduce 1 devices
SwiGLU 1 replicas, TaskMapping(tasks_per_device={'SwiGLUFilter':
28672})x1
AllReduce 1 devices
AllGather 1 devices
--------------------------------------------------------------------------------
* Statistics:
--------------------------------------------------------
Name Value
--------------------------------------------------------
Parameter size per device (GB) 63.8
Activation memory per device (GB) 16.2
Avg. requests per iteration (per microbatch) 5.6
Avg. tokens per iteration (per microbatch) 37.7
--------------------------------------------------------
* Performance Metrics:
--------------------------------------------------------------------------------
Name Value Units
--------------------------------------------------------------------------------
Throughput: Avg. Tokens generated per second 163.343 tokens/sec
Throughput: Avg. Tokens processed per second 1099.480 tokens/sec
Throughput: Requests per second 0.782 requests/sec
Latency: Avg. Time to first token (TTFT in msec) 279.278 msec
Latency: Avg. Time per output token (TPOT in msec) 62.767 msec
Avg. TBT Percentile: P50 61.840 msec
Avg. TBT Percentile: P95 66.089 msec
Request Completion Latency: 50th percentile 10.981 sec
Request Completion Latency: 95th percentile 24.774 sec
Avg. MFU Per iteration 3.193 %
MBU 30.640 %
--------------------------------------------------------------------------------
* Time breakdown:
--------------------------------------
Name Time (sec) Ratio (%)
--------------------------------------
MQA 34.17 ± 1.0 26.7
SwiGLU 83.28 ± 2.6 65.1
SendRecv 0.04 ± 0.0 0.0
Idle 10.46 ± 3.6 8.2
Total 127.95 100.0
--------------------------------------
Energy Consumption: 158.52 KJ
================================================================================
The result indicates that the best execution plan (parallel schedule 0) for this specific case is to use Megatron-style 2-way tensor parallelism for both MQA and SwiGLU layers. APEX+ also shows that the third best plan (parallel schedule 2) is to use 2-way pipeline parallelism.
Please refer to another README for instructions on profiling.
APEX is capable of providing insights into the percentage of requests that meet a target SLO. By deafult, APEX enables SLOs to default values of 10 msec for both TTFT and TPOT.
A max batch size can also be set for APEX. Both functionalities are configured with the 3 corresponding flags shown below
python main.py --num-gpus 8 --num-nodes 1 --model llama3-70b --num-requests 10 --ttft-slo 100 --tpot-slo 30 --max-batch 10The APEX output will include a SLO Metrics based off the candidate plan.
--------------------------------------------------------------------------------
* Latency SLO Metrics:
* Batch Size: 10
-----------------------------------------
Name Target SLO(msec) SLOs Met(%)
-----------------------------------------
TTFT 100 0.000 %
TPOT 30 100.000 %
-----------------------------------------The paper of this work can be accessed here.
DOI: 10.5281/zenodo.15300595
@misc{apex,
title={APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving},
author={Yi-Chien Lin and Woosuk Kwon and Ronald Pineda and Fanny Nina Paravecino},
year={2024},
eprint={2411.17651},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2411.17651},
}