Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs

Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs
Hardware Setup

GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
RAM: 32GB DDR5
OS: Pop!_OS (Ubuntu-based)
CUDA: 13.0, Driver 580.119.02

Current Situation
With --offload_to_cpu true, the 1.7B LM + turbo DiT works perfectly. The 1.7B LM + SFT DiT also works but requires restarting between generations due to memory fragmentation during VAE decode (OOM by ~44-50MB).
The 4B LM fails on both turbo and SFT, OOMing during model loading or generation. The 4B model weights are ~8.4GB in bf16/fp16, which alongside the DiT (~4.8GB) and VAE (~337MB) exceeds 16GB even with CPU offloading, due to runtime memory overhead (attention caches, latent tensors, CUDA graphs, etc).
The Opportunity
The 4B LM quantized to 4-bit via bitsandbytes would be ~3-4GB — comfortably fitting on 16GB alongside the DiT and VAE. Quality loss from 4-bit quantization on a 4B parameter model is typically negligible.
16GB GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti, RTX A4000, etc.) are a large segment of the consumer market. Enabling the 4B LM on these cards would be a significant quality uplift for many users.
Suggested Implementation
The simplest path would be adding a --lm_quantize 4bit flag that loads the LM via bitsandbytes:
pythonfrom transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    trust_remote_code=True,
)
This would apply to the PyTorch fallback path in llm_inference.py. The nanovllm/vLLM path could remain unchanged for users with larger GPUs.
Alternative approaches

GGUF support via llama.cpp — more work but would also enable CPU-only LM inference
Smarter sequential offloading — fully unload DiT before VAE decode, and vice versa
CPU-based VAE decode — the VAE is only 337MB, decoding on CPU would be slower but viable for users who prefer quality over speed

OOM Traces for Reference
4B LM + turbo DiT (OOM during generation)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB.
GPU 0 has a total capacity of 15.45 GiB of which 190.25 MiB is free.
Including non-PyTorch memory, this process has 14.50 GiB memory in use.
1.7B LM + SFT DiT (OOM during VAE decode — only 44MB short)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB.
GPU 0 has a total capacity of 15.45 GiB of which 50.06 MiB is free.
Including non-PyTorch memory, this process has 14.64 GiB memory in use.
Environment

ACE-Step 1.5 (latest main branch as of 2026-02-10)
PyTorch 2.10.0+cu128
Python 3.11
Launched with: PYTORCH_ALLOC_CONF=expandable_segments:True uv run acestep --init_service true --config_path acestep-v15-turbo --lm_model_path acestep-5Hz-lm-4B --offload_to_cpu true

Thank you for this incredible project — running commercial-grade music generation locally is genuinely exciting. This optimization would make the best quality config accessible to the large 16GB GPU user base.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs #428

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs #428

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions