Skip to content

Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs #428

@Arjuna66671

Description

@Arjuna66671

Feature Request: bitsandbytes 4-bit quantization for LM to enable 4B model on 16GB GPUs
Hardware Setup

GPU: NVIDIA GeForce RTX 5060 Ti (16GB VRAM)
RAM: 32GB DDR5
OS: Pop!_OS (Ubuntu-based)
CUDA: 13.0, Driver 580.119.02

Current Situation
With --offload_to_cpu true, the 1.7B LM + turbo DiT works perfectly. The 1.7B LM + SFT DiT also works but requires restarting between generations due to memory fragmentation during VAE decode (OOM by ~44-50MB).
The 4B LM fails on both turbo and SFT, OOMing during model loading or generation. The 4B model weights are ~8.4GB in bf16/fp16, which alongside the DiT (~4.8GB) and VAE (~337MB) exceeds 16GB even with CPU offloading, due to runtime memory overhead (attention caches, latent tensors, CUDA graphs, etc).
The Opportunity
The 4B LM quantized to 4-bit via bitsandbytes would be ~3-4GB — comfortably fitting on 16GB alongside the DiT and VAE. Quality loss from 4-bit quantization on a 4B parameter model is typically negligible.
16GB GPUs (RTX 4060 Ti 16GB, RTX 5060 Ti, RTX A4000, etc.) are a large segment of the consumer market. Enabling the 4B LM on these cards would be a significant quality uplift for many users.
Suggested Implementation
The simplest path would be adding a --lm_quantize 4bit flag that loads the LM via bitsandbytes:
pythonfrom transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
trust_remote_code=True,
)
This would apply to the PyTorch fallback path in llm_inference.py. The nanovllm/vLLM path could remain unchanged for users with larger GPUs.
Alternative approaches

GGUF support via llama.cpp — more work but would also enable CPU-only LM inference
Smarter sequential offloading — fully unload DiT before VAE decode, and vice versa
CPU-based VAE decode — the VAE is only 337MB, decoding on CPU would be slower but viable for users who prefer quality over speed

OOM Traces for Reference
4B LM + turbo DiT (OOM during generation)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB.
GPU 0 has a total capacity of 15.45 GiB of which 190.25 MiB is free.
Including non-PyTorch memory, this process has 14.50 GiB memory in use.
1.7B LM + SFT DiT (OOM during VAE decode — only 44MB short)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB.
GPU 0 has a total capacity of 15.45 GiB of which 50.06 MiB is free.
Including non-PyTorch memory, this process has 14.64 GiB memory in use.
Environment

ACE-Step 1.5 (latest main branch as of 2026-02-10)
PyTorch 2.10.0+cu128
Python 3.11
Launched with: PYTORCH_ALLOC_CONF=expandable_segments:True uv run acestep --init_service true --config_path acestep-v15-turbo --lm_model_path acestep-5Hz-lm-4B --offload_to_cpu true

Thank you for this incredible project — running commercial-grade music generation locally is genuinely exciting. This optimization would make the best quality config accessible to the large 16GB GPU user base.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions