A comprehensive framework for multi-node, multi-GPU scalable LLM inference on HPC systems using vLLM and Ollama. Includes distributed deployment templates, benchmarking workflows, and chatbot/RAG pipelines for high-throughput, production-grade AI services
This repository focuses on:
- Multi-node vLLM inference on HPC clusters
- GPU-accelerated LLM inference
- Kubernetes / Slurm based distributed inference
- HPC-scale RAG / Chatbot pipelines