Aegis: HPC LLM Instance Launcher

Aegis automates launching configurable numbers of vLLM inference instances on HPC clusters. It handles PBS job generation, model weight staging via MPI broadcast, and per-instance orchestration.

Currently targets Aurora (PBS). Frontier/Slurm support planned.

Installation

pip install .

For development:

pip install -e .

Installing in a conda environment

Create a standalone conda environment for Aegis:

conda create -n aegis python=3.11 -y
conda activate aegis
pip install .

Aegis is a launcher — it does not require vLLM itself. vLLM only needs to be available on the compute nodes, either via a system module (e.g., module load frameworks on Aurora) or a separate conda environment distributed with --conda-env. See Staging a conda environment below.

Usage

Submit a PBS job

Generate and submit a batch job to the PBS queue:

aegis submit --config config.yaml

Preview the generated PBS script without submitting:

aegis submit --config config.yaml --dry-run

Launch inside an existing allocation

If you already have a PBS allocation (e.g., via qsub -I), launch instances directly:

aegis launch --config config.yaml

CLI flags

All config values can be overridden via CLI flags. CLI flags take precedence over the config file.

aegis submit \
    --model meta-llama/Llama-3.3-70B-Instruct \
    --instances 2 \
    --tensor-parallel-size 6 \
    --account MyProject \
    --walltime 01:00:00 \
    --model-source /flare/datasets/model-weights/hub/models--meta-llama--Llama-3.3-70B-Instruct \
    --dry-run

Example Config (YAML)

Single model

model: meta-llama/Llama-3.3-70B-Instruct
instances: 2
tensor_parallel_size: 6
port_start: 8000
hf_home: /tmp/hf_home
model_source: /flare/datasets/model-weights/hub/models--meta-llama--Llama-3.3-70B-Instruct
walltime: "01:00:00"
account: MyProject
queue: debug
filesystems: flare:home
extra_vllm_args:
  - --max-model-len
  - "32768"

Multiple models

Launch different models within a single job allocation. Each model can have its own instance count, tensor-parallel size, weight source, and vLLM arguments. Ports are assigned per node starting from port_start, incrementing only for additional instances on the same node (e.g., two instances on node1 get 8000–8001, one instance on node2 gets 8000).

port_start: 8000
hf_home: /tmp/hf_home
walltime: "01:00:00"
account: MyProject
filesystems: flare:home

models:
  - model: meta-llama/Llama-3.3-70B-Instruct
    instances: 2
    tensor_parallel_size: 6
    model_source: /flare/datasets/model-weights/hub/models--meta-llama--Llama-3.3-70B-Instruct
    extra_vllm_args:
      - --max-model-len
      - "32768"
  - model: meta-llama/Llama-3.1-8B-Instruct
    instances: 1
    tensor_parallel_size: 1

Staging a conda environment

To distribute a custom conda environment (e.g., one containing vLLM) to all compute nodes, create a conda-pack tarball and pass it via --conda-env:

# Create the tarball from an existing conda environment
conda pack -n my_vllm_env -o my_vllm_env.tar.gz

Then reference it in your config file:

model: meta-llama/Llama-3.3-70B-Instruct
instances: 2
tensor_parallel_size: 6
conda_env: /path/to/my_vllm_env.tar.gz
walltime: "01:00:00"
account: MyProject
filesystems: flare:home

Or pass it as a CLI flag:

aegis submit --config config.yaml --conda-env /path/to/my_vllm_env.tar.gz

Aegis will broadcast the tarball to all nodes and activate the environment before launching vLLM instances.

Architecture

src/aegis/
├── cli.py              # CLI entry point (argparse)
├── config.py           # Config file loading + merging with CLI args
├── scheduler.py        # PBS job generation and submission
├── launcher.py         # Core orchestration: stage weights, launch instances
└── templates/
    ├── pbs_job.sh.j2   # Jinja2 template for PBS batch script
    └── instance.sh.j2  # Jinja2 template for per-node vLLM launch script

How it works

aegis submit renders a PBS batch script from a Jinja2 template and submits it via qsub. The generated job script calls aegis launch inside the allocation.
aegis launch runs inside a PBS allocation. It optionally stages model weights to local storage using the MPI broadcast tool (tools/bcast.c), then launches one vllm serve process per instance on the assigned nodes.

Platform Documentation

Reference documentation for manual setup on each platform:

Aurora — Intel Data Center GPU Max, PBS
Frontier — AMD Instinct MI250X, Slurm

Tools

tools/bcast.c — MPI-based tool for efficiently broadcasting model weights to all compute nodes via tar streaming.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
docs		docs
src/aegis		src/aegis
tools		tools
.gitignore		.gitignore
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aegis: HPC LLM Instance Launcher

Installation

Installing in a conda environment

Usage

Submit a PBS job

Launch inside an existing allocation

CLI flags

Example Config (YAML)

Single model

Multiple models

Staging a conda environment

Architecture

How it works

Platform Documentation

Tools

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

AI-ModCon/BaseCAF_Aegis

Folders and files

Latest commit

History

Repository files navigation

Aegis: HPC LLM Instance Launcher

Installation

Installing in a conda environment

Usage

Submit a PBS job

Launch inside an existing allocation

CLI flags

Example Config (YAML)

Single model

Multiple models

Staging a conda environment

Architecture

How it works

Platform Documentation

Tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages