-
Notifications
You must be signed in to change notification settings - Fork 549
Description
Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs
Hardware Setup
GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
RAM: 32GB DDR5
OS: Pop!_OS (Ubuntu-based)
CUDA: 13.0, Driver 580.119.02
Current Situation
With --offload_to_cpu true, the 1.7B LM + turbo DiT works perfectly. The 1.7B LM + SFT DiT also works but requires restarting between generations due to memory fragmentation during VAE decode (OOM by ~44-50MB).
The 4B LM fails on both turbo and SFT, OOMing during model loading or generation. The 4B model weights are ~8.4GB in bf16/fp16, which alongside the DiT (~4.8GB) and VAE (~337MB) exceeds 16GB even with CPU offloading, due to runtime memory overhead (attention caches, latent tensors, CUDA graphs, etc).
The Opportunity
The 4B LM quantized to 4-bit via bitsandbytes would be ~3-4GB — comfortably fitting on 16GB alongside the DiT and VAE. Quality loss from 4-bit quantization on a 4B parameter model is typically negligible.
16GB GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti, RTX A4000, etc.) are a large segment of the consumer market. Enabling the 4B LM on these cards would be a significant quality uplift for many users.
Suggested Implementation
The simplest path would be adding a --lm_quantize 4bit flag that loads the LM via bitsandbytes:
pythonfrom transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
trust_remote_code=True,
)
This would apply to the PyTorch fallback path in llm_inference.py. The nanovllm/vLLM path could remain unchanged for users with larger GPUs.
Alternative approaches
GGUF support via llama.cpp — more work but would also enable CPU-only LM inference
Smarter sequential offloading — fully unload DiT before VAE decode, and vice versa
CPU-based VAE decode — the VAE is only 337MB, decoding on CPU would be slower but viable for users who prefer quality over speed
OOM Traces for Reference
4B LM + turbo DiT (OOM during generation)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB.
GPU 0 has a total capacity of 15.45 GiB of which 190.25 MiB is free.
Including non-PyTorch memory, this process has 14.50 GiB memory in use.
1.7B LM + SFT DiT (OOM during VAE decode — only 44MB short)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB.
GPU 0 has a total capacity of 15.45 GiB of which 50.06 MiB is free.
Including non-PyTorch memory, this process has 14.64 GiB memory in use.
Environment
ACE-Step 1.5 (latest main branch as of 2026-02-10)
PyTorch 2.10.0+cu128
Python 3.11
Launched with: PYTORCH_ALLOC_CONF=expandable_segments:True uv run acestep --init_service true --config_path acestep-v15-turbo --lm_model_path acestep-5Hz-lm-4B --offload_to_cpu true
Thank you for this incredible project — running commercial-grade music generation locally is genuinely exciting. This optimization would make the best quality config accessible to the large 16GB GPU user base.