This library is designed for efficient inference using BitMamba2 models with 255M and 1B parameters. It implements support for quantization and BitNet-optimized architectures.
Supported: Intel Haswell (4th Gen) or newer, AMD Ryzen.
Not currently supported: ARM devices (Raspberry Pi, Apple Silicon, Android) require a port to NEON intrinsics.
Note: The architectural efficiency (250MB RAM usage) makes it theoretically ideal for Edge devices, but this specific demo code is optimized for x86 desktops.
Use the scripts/export_bin.py script to convert your PyTorch/JAX checkpoints to the optimized C++ binary format.
Arguments:
--version: Model version to export (1bor250m).--ckpt_path: Path to the checkpoint file (.msgpack).--output_name: Output binary filename.
python3 scripts/export_bin.py --version 1b --ckpt_path ./bitmamba_1b.msgpack --output_name bitmamba_1b.binpython3 scripts/export_bin.py --version 250m --ckpt_path ./bitmamba_250m.msgpack --output_name bitmamba_250m.binEnsure you have CMake installed (sudo apt install cmake or equivalent).
cmake -B build
cmake --build buildThe executable will be located at build/bitmamba.
If you prefer g++:
g++ -O3 -march=native -fopenmp -Iinclude -Isrc -o bitmamba examples/main.cpp src/*.cppBitMamba-2 1B
wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/bitmamba_cpp/bitmamba_1b.binBitMamba-2 0.25B
wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/bitmamba_cpp/bitmamba_255m.binOnce you have the binary model (.bin) and the compiled executable, use the exported binary to run inference.
Example command:
./build/bitmamba <model.bin> "<prompt_tokens>" <mode> <temp> <repeat_penalty> <top_p> <top_k> <max_tokens>Tokenizer mode:
./build/bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200Raw mode:
./build/bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200Tokenizer mode:
./bitmamba bitmamba_1b.bin "Hello, I am" tokenizer 0.7 1.1 0.05 0.9 40 200Raw mode:
./bitmamba bitmamba_1b.bin "15496 11 314 716" raw 0.7 1.1 0.05 0.9 40 200This command runs the bitmamba_1b.bin model with a tokenized prompt, temperature 0.7, repetition penalty 1.1, generating 200 tokens.
If you use raw mode, you can use the scripts/decoder.py script to convert token IDs back into text.
Usage:
python scripts/decoder.py "tokens"Example:
python scripts/decoder.py "15496 11 314 716"- Future Work: Add ARM/NEON support for Raspberry Pi deployment.
Use the scripts/fast_inference.py script to evaluate the models:
Weights for 250M version:
wget https://huggingface.co/Zhayr1/BitMamba-2-0.25B/resolve/main/jax_weights/bitmamba_255m.msgpackWeights for 1B version:
wget https://huggingface.co/Zhayr1/BitMamba-2-1B/resolve/main/jax_weights/bit_mamba_1b.msgpackpython scripts/fast_inference.py --ckpt bitmamba_255m.msgpack --version 250m --evalpython scripts/fast_inference.py --ckpt bit_mamba_1b.msgpack --version 1b --eval