A simplified and optimized transformer model architecture based on OLMo (Open Language Model), specifically designed for AMD MI300X accelerators with FP8 quantization support and vLLM integration.
- Base Architecture: Simplified OLMo transformer decoder
- Target Hardware: AMD Instinct MI300X
- Precision: FP8 quantization for inference
- Framework: PyTorch with ROCm optimizations
- Inference Engine: vLLM integration
- Custom Kernels: Hugging Face kernel-builder integration
- Optimized kernels for MI300X compute units
- FP8 quantization for memory efficiency
- Flash Attention 3.0 integration
- Transformer Engine optimizations
- Multi-dataset training support (up to 6 datasets)
- Custom HIP kernels via kernel-builder
- Docker containerization for easy deployment
kernel_tonic/
├── src/
│ ├── model/
│ │ ├── __init__.py
│ │ ├── architecture.py # Simplified OLMo architecture
│ │ ├── layers.py # Custom layers and kernels
│ │ ├── config.py # Model configuration
│ │ └── kernel_integration.py # Custom kernel integration
│ ├── kernels/
│ │ ├── __init__.py
│ │ ├── attention.py # Optimized attention kernels
│ │ ├── linear.py # Optimized linear kernels
│ │ └── activation.py # Optimized activation kernels
│ ├── quantization/
│ │ ├── __init__.py
│ │ ├── fp8.py # FP8 quantization
│ │ └── kernels.py # Quantized kernels
│ └── training/
│ ├── __init__.py
│ ├── trainer.py # Training loop
│ └── datasets.py # Multi-dataset handling
├── kernels/
│ └── monolithic/
│ ├── kernel_builder.yaml # Kernel builder config
│ ├── kernel.hip # Custom HIP kernels
│ └── __init__.py # Python interface
├── configs/
│ ├── model_config.yaml # Model configuration
│ └── training_config.yaml # Training parameters
├── scripts/
│ ├── train.py # Training script
│ ├── export_vllm.py # vLLM export script
│ ├── run_training.sh # Complete training pipeline
│ └── test_inference.py # Inference test script
├── tests/
│ ├── test_model.py
│ ├── test_kernels.py
│ └── test_quantization.py
├── Dockerfile # Docker image for MI300X
├── docker-compose.yml # Docker Compose setup
├── requirements.txt
├── setup.py
└── README.md
- Compute Units: Optimized for MI300X CDNA3 architecture
- Memory Hierarchy: Leverages HBM3 memory bandwidth
- Tensor Cores: FP8 tensor operations
- Multi-GPU: Support for multi-MI300X configurations
- Custom attention kernels using HIP
- Optimized linear layer implementations
- Flash Attention 3.0 integration
- Memory-efficient activation functions
- Monolithic kernel-builder integration
- AMD MI300X GPU
- ROCm 5.7.3+
- Docker and Docker Compose
- Hugging Face token for dataset access
export HF_TOKEN=your_huggingface_token_here# Make script executable
chmod +x scripts/run_training.sh
# Run complete pipeline
./scripts/run_training.shThis will:
- Build the Docker image
- Train the model on Colossal OSCAR 1.0 (all languages)
- Export the model for vLLM
- Start the vLLM inference server
python scripts/test_inference.py# Build Docker image
docker build -t kernel-tonic:latest .
# Run training
docker run --gpus all -e HF_TOKEN=$HF_TOKEN \
-v $(pwd)/data:/workspace/data \
-v $(pwd)/models:/workspace/models \
-v $(pwd)/logs:/workspace/logs \
kernel-tonic:latest train --config small --batch-size 4docker run --gpus all \
-v $(pwd)/checkpoints:/workspace/checkpoints \
-v $(pwd)/models:/workspace/models \
kernel-tonic:latest export \
--checkpoint /workspace/checkpoints/best_model.pt \
--output-dir /workspace/models/kernel-tonic \
--config smalldocker run --gpus all -p 8000:8000 \
-v $(pwd)/models:/workspace/models \
kernel-tonic:latest vllmdocker-compose up kernel-tonic-traindocker-compose up -d kernel-tonic-vllmdocker-compose up kernel-tonic-exportThe project includes a monolithic kernel-builder setup in kernels/monolithic/:
# Kernel builder configuration
kernels/monolithic/kernel_builder.yaml
# Custom HIP kernels
kernels/monolithic/kernel.hip
# Python interface
kernels/monolithic/__init__.py- Add kernel function to
kernel.hip - Update
kernel_builder.yaml - Add Python interface in
__init__.py - Integrate in
src/model/kernel_integration.py
The training uses the Colossal OSCAR 1.0 dataset with all languages. The dataset is automatically downloaded and streamed during training.
- Small: 125M parameters (768 hidden, 12 layers)
- Medium: 1.3B parameters (1536 hidden, 24 layers)
- Large: 7B parameters (4096 hidden, 32 layers)
- XLarge: 13B parameters (5120 hidden, 40 layers)
Based on AMD ROCm documentation:
- Throughput: Optimized for high token generation rates
- Latency: Low inference latency for real-time applications
- Memory Efficiency: FP8 quantization for reduced memory footprint
- Scalability: Multi-GPU support for larger models
Once the vLLM server is running, you can make requests:
import requests
# Text completion
response = requests.post("http://localhost:8000/v1/completions", json={
"model": "kernel-tonic",
"prompt": "The future of AI is",
"max_tokens": 100,
"temperature": 0.7
})
# Chat completion
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "kernel-tonic",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
})-
HF_TOKEN not set
export HF_TOKEN=your_token_here -
GPU not detected
# Check ROCm installation rocm-smi -
vLLM server not starting
# Check logs docker-compose logs kernel-tonic-vllm -
Out of memory
- Reduce batch size
- Use smaller model configuration
- Enable gradient checkpointing
[Add your license information here]
[Add contribution guidelines here]
- Launch an MI300X instance on RunPod.io.
- SSH into your instance and clone your repo.
git clone <your-repo-url> cd kernel_tonic
- Set your Hugging Face token:
export HF_TOKEN=your_hf_token_here - Run the training pipeline:
chmod +x scripts/run_training.sh ./scripts/run_training.sh
Push Model:
python scripts/push_model_to_hub.pyPush Kernel:
python scripts/push_kernel_to_hub.py- Build the Docker image:
docker build -t kernel-tonic:latest . - Run the vLLM server:
docker run --gpus all -p 8000:8000 \ -v $(pwd)/models:/workspace/models \ kernel-tonic:latest vllm - Test inference:
python scripts/test_inference.py