Aegis automates launching configurable numbers of vLLM inference instances on HPC clusters. It handles PBS job generation, model weight staging via MPI broadcast, and per-instance orchestration.
Currently targets Aurora (PBS). Frontier/Slurm support planned.
pip install .For development:
pip install -e .Create a standalone conda environment for Aegis:
conda create -n aegis python=3.11 -y
conda activate aegis
pip install .Aegis is a launcher — it does not require vLLM itself. vLLM only needs to be available on the compute nodes, either via a system module (e.g., module load frameworks on Aurora) or a separate conda environment distributed with --conda-env. See Staging a conda environment below.
Generate and submit a batch job to the PBS queue:
aegis submit --config config.yamlPreview the generated PBS script without submitting:
aegis submit --config config.yaml --dry-runIf you already have a PBS allocation (e.g., via qsub -I), launch instances directly:
aegis launch --config config.yamlAll config values can be overridden via CLI flags. CLI flags take precedence over the config file.
aegis submit \
--model meta-llama/Llama-3.3-70B-Instruct \
--instances 2 \
--tensor-parallel-size 6 \
--account MyProject \
--walltime 01:00:00 \
--model-source /flare/datasets/model-weights/hub/models--meta-llama--Llama-3.3-70B-Instruct \
--dry-runmodel: meta-llama/Llama-3.3-70B-Instruct
instances: 2
tensor_parallel_size: 6
port_start: 8000
hf_home: /tmp/hf_home
model_source: /flare/datasets/model-weights/hub/models--meta-llama--Llama-3.3-70B-Instruct
walltime: "01:00:00"
account: MyProject
queue: debug
filesystems: flare:home
extra_vllm_args:
- --max-model-len
- "32768"Launch different models within a single job allocation. Each model can have its own instance count, tensor-parallel size, weight source, and vLLM arguments. Ports are assigned per node starting from port_start, incrementing only for additional instances on the same node (e.g., two instances on node1 get 8000–8001, one instance on node2 gets 8000).
port_start: 8000
hf_home: /tmp/hf_home
walltime: "01:00:00"
account: MyProject
filesystems: flare:home
models:
- model: meta-llama/Llama-3.3-70B-Instruct
instances: 2
tensor_parallel_size: 6
model_source: /flare/datasets/model-weights/hub/models--meta-llama--Llama-3.3-70B-Instruct
extra_vllm_args:
- --max-model-len
- "32768"
- model: meta-llama/Llama-3.1-8B-Instruct
instances: 1
tensor_parallel_size: 1To distribute a custom conda environment (e.g., one containing vLLM) to all compute nodes, create a conda-pack tarball and pass it via --conda-env:
# Create the tarball from an existing conda environment
conda pack -n my_vllm_env -o my_vllm_env.tar.gzThen reference it in your config file:
model: meta-llama/Llama-3.3-70B-Instruct
instances: 2
tensor_parallel_size: 6
conda_env: /path/to/my_vllm_env.tar.gz
walltime: "01:00:00"
account: MyProject
filesystems: flare:homeOr pass it as a CLI flag:
aegis submit --config config.yaml --conda-env /path/to/my_vllm_env.tar.gzAegis will broadcast the tarball to all nodes and activate the environment before launching vLLM instances.
src/aegis/
├── cli.py # CLI entry point (argparse)
├── config.py # Config file loading + merging with CLI args
├── scheduler.py # PBS job generation and submission
├── launcher.py # Core orchestration: stage weights, launch instances
└── templates/
├── pbs_job.sh.j2 # Jinja2 template for PBS batch script
└── instance.sh.j2 # Jinja2 template for per-node vLLM launch script
aegis submitrenders a PBS batch script from a Jinja2 template and submits it viaqsub. The generated job script callsaegis launchinside the allocation.aegis launchruns inside a PBS allocation. It optionally stages model weights to local storage using the MPI broadcast tool (tools/bcast.c), then launches onevllm serveprocess per instance on the assigned nodes.
Reference documentation for manual setup on each platform:
tools/bcast.c— MPI-based tool for efficiently broadcasting model weights to all compute nodes viatarstreaming.