MLMF (Machine Learning Model Files) is a comprehensive Rust crate for working with ML model files. MLMF provides loading, saving, conversion, and dynamic mapping capabilities for transformer models across multiple formats including SafeTensors, GGUF, ONNX, PyTorch, and AWQ. It eliminates code duplication and provides a unified, efficient API for model file operations.
- ๐๏ธ Architecture Detection: Automatically detects model architecture (LLaMA, GPT-2, GPT-NeoX) from tensor names
- ๐ฆ Multiple Formats: Comprehensive support for SafeTensors, GGUF, ONNX, PyTorch, and AWQ formats
- ๐บ๏ธ Name Mapping: Intelligent tensor name mapping between HuggingFace and custom formats
- ๐พ Memory Efficient: Memory-mapped loading for large models (30GB+)
- โก Quantization: Advanced post-training quantization with multiple schemes (INT8, INT4, Mixed)
- ๐ง Device Management: Automatic CUDA detection with CPU fallback
- ๐ Progress Reporting: Optional progress callbacks for long-running operations
- ๐ก๏ธ Type Safety: Comprehensive error handling with detailed context
- ๐ Model Conversion: Direct format conversion with batch processing and progress tracking
Add mlmf to your Cargo.toml:
[dependencies]
mlmf = { git = "https://github.com/CireSnave/mlmf", tag = "v0.2.1" }use mlmf::{LoadOptions, loader};
use candlelight::{Device, DType};
// Load a LLaMA model from SafeTensors
let device = Device::cuda_if_available(0).unwrap_or(Device::Cpu);
let options = LoadOptions {
device: device.clone(),
dtype: DType::F16,
use_mmap: true,
validate_cuda: false,
progress: Some(mlmf::progress::default_progress()),
};
let loaded_model = loader::load_safetensors("./models/llama-7b", options)?;
// Access components
let var_builder = loaded_model.var_builder;
let config = loaded_model.config;
let name_mapper = loaded_model.name_mapper;
// Use name mapper to convert HF names to your format
if let Some(mapped_name) = name_mapper.map_name("model.layers.0.self_attn.q_proj.weight") {
println!("Mapped name: {}", mapped_name);
}use mlmf::name_mapping::{TensorNameMapper, Architecture};
let tensor_names = vec![
"model.embed_tokens.weight".to_string(),
"model.layers.0.self_attn.q_proj.weight".to_string(),
"model.norm.weight".to_string(),
];
let mapper = TensorNameMapper::from_tensor_names(&tensor_names)?;
assert_eq!(mapper.architecture(), Architecture::LLaMA);use mlmf::conversion::{convert_model, ConversionFormat, ConversionOptions};
use std::path::Path;
// Convert from SafeTensors to ONNX
let options = ConversionOptions::default();
let result = convert_model(
Path::new("model.safetensors"),
Path::new("model.onnx"),
ConversionFormat::ONNX,
options,
)?;
println!("Conversion completed in {:.2}s", result.duration.as_secs_f64());use mlmf::quantization::{QuantizationConfig, QuantizationEngine, QuantizationType, CalibrationMethod};
// Configure quantization
let config = QuantizationConfig {
quantization_type: QuantizationType::Int8,
calibration_method: CalibrationMethod::KlDivergence,
calibration_samples: 256,
block_wise: true,
symmetric: true,
..Default::default()
};
// Create quantization engine
let engine = QuantizationEngine::new(config, device)?;
// Quantize a loaded model (placeholder - requires actual model)
// let quantized_model = engine.quantize_model(&loaded_model, progress_callback)?;MLMF provides a modular architecture with the following components:
loader: High-level loading APIconversion: Direct model format conversion with batch processingname_mapping: Architecture detection and tensor name mappingconfig: HuggingFace config parsing with field aliasesformats: Format-specific loaders and exporters (SafeTensors, GGUF, ONNX, PyTorch, AWQ)validation: CUDA validation and dtype checkingprogress: Progress reporting utilities
- LLaMA Family: LLaMA 2/3, TinyLlama, Qwen, Mistral
- GPT Family: GPT-2, GPT-J
- GPT-NeoX Family: GPT-NeoX, Pythia, StableLM
See the examples/ directory for complete working examples:
load_llama.rs- Loading LLaMA models from SafeTensorsadvanced_quantization.rs- Advanced quantization API usagetest_gguf_loading.rs- Loading quantized GGUF modelspytorch_support_example.rs- Loading PyTorch modelsonnx_export_example.rs- Exporting models to ONNX formatmultimodal_demo.rs- Multi-modal model handling
MLMF is optimized for performance:
- Memory-mapped loading: Loads 70B models (130GB) in ~10 seconds
- Architecture detection: Typically completes in <100ms
- Zero-copy: Direct tensor access without unnecessary copying
- Incremental builds: Changes compile in <10 seconds
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.