Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Weiming Hu, Zhipeng Zhang
*This work was completed during Hanshi’s remote internship at AutoLab, SJTU.
[2025.9.18] AutoPrune is accepted by NeurIPS 2025.
The established redundancy in visual tokens within large vision–language models (LVLMs) allows for pruning to effectively reduce their substantial computational demands. Empirical evidence from previous works indicates that visual tokens in later decoder stages receive less attention than shallow layers. Then, previous methods typically employ heuristics layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model’s holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in LVLMs. This observation strongly suggests that neither a fixed pruning schedule nor a heuristics layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), which is a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, and then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, is shown to effectively correspond with the specific complexity of different tasks, and can easily guarantee adherence to a pre-defined computational constraints. We evaluate AutoPrune not only on standard vision-language tasks but also on Vision-Language-Action (VLA) models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8%, but still retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop (CVPR'2025), demonstrating the effectivenes.
- Clone this repository.
https://github.com/AutoLab-SAI-SJTU/AutoPrune.git
cd AutoPrune- Install necessary packages.
conda create -n AutoPrune python=3.10 -y
conda activate AutoPrune
pip install -e .- (Optional) Install FlashAttention for further inference acceleration.
pip install flash-attn --no-build-isolationDownload corresponding LLaVA checkpoints from Hugging Face 🤗:
| Version | LLM | Checkpoint |
|---|---|---|
| LLaVA-1.5 | Vicuna-7B | liuhaotian/llava-v1.5-7b |
| LLaVA-1.5 | Vicuna-13B | liuhaotian/llava-v1.5-13b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-7B | liuhaotian/llava-v1.6-vicuna-7b |
| LLaVA-1.6 (LLaVA-NeXT) | Vicuna-13B | liuhaotian/llava-v1.6-vicuna-13b |
Download each dataset according to EVAL.md.
Using TextVQA as an example (scripts/v1_5/eval/textvqa.sh), inference is controlled by a few hyperparameters that shape the visual‑token retention curve:
--visual-token-num: Initial number of visual tokens produced by the vision tower (LLaVA‑1.5: 576; LLaVA‑1.6: 2880). This is an upper bound; pruning will dynamically reduce it.--target-token-num: Target visual‑token budget. In the scripts, the first positional argumentTOKENis passed here. Smaller values prune more aggressively; larger values keep more tokens.--x0: Horizontal shift of the logistic retention curve. Increasingx0delays strong pruning to later layers (keeping more tokens early); decreasing it starts shrinking earlier.--k0and--gamma: Control the MI‑adaptive slope of the curve.- Internally we compute
dynamic_k = max(-gamma * MI + k0, 0)and use it as the slope of the logistic curve. - Intuition:
k0sets the base steepness (larger → sharper), whilegammacontrols sensitivity to sample complexity (mutual information; larger → more sensitive).
- Internally we compute
How to run (direct Python invocation is equivalent to the shell script):
python -W ignore -m llava.eval.model_vqa_loader \
--model-path ./models/llava-v1.5-7b \
--question-file ./playground/data/eval/textvqa/llava_textvqa_val_v051_ocr.jsonl \
--image-folder ./playground/data/eval/textvqa/train_images \
--answers-file "${OUT_JSONL}" \
--visual-token-num 576 \
--temperature 0 \
--conv-mode vicuna_v1 \
--x0 14.9 \
--k0 0.4 \
--gamma 0.2 \
--target-token-num ${TOKEN}CUDA_VISIBLE_DEVICES=2 bash scripts/v1_5/eval/textvqa.sh 64- The trailing
64is theTOKENargument; the script forwards it to--target-token-numas your visual‑token budget (smaller → more pruning). - v1_5 scripts fix
--visual-token-num 576; v1_6 scripts fix--visual-token-num 2880.
Tuning tip: Optimal settings may vary by dataset/task. You can tuning --x0 / --k0 / --gamma per dataset for the best results. In our method we did not perform fine-grained hyperparameter tuning in order to demonstrate robustness, so with proper tuning it is likely to surpass the results reported in our paper.
If you find AutoPrune useful for your research and applications, please cite using this BibTeX:
@article{wang2025autoprune,
title={Each Complexity Deserves a Pruning Policy},
author={Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang},
journal={Advances in Neural Information Processing Systems},
year={2025}
}This project is released under the Apache 2.0 license.
AutoPrune uses code from a few open source repositories. Without the efforts of these folks (and their willingness to release their implementations), AutoPrune would not be possible. We thanks these authors for their efforts!