SparseZip is a training-free, plug-and-play vision token compression method for Vision-Language Models (VLMs). It reduces the number of visual tokens processed by the LLM while preserving semantic fidelity through text-aware token selection and hierarchical merging.
- POPE Accuracy: 83.7% (vs 85.7% LLaVA baseline, only 2% drop)
- Latency Reductions:
- POPE: 19.2% reduction (479.987ms vs 594.238ms)
- MME: 36.2% reduction (199.748ms vs 313.143ms)
- DocVQA: 11.0% reduction (928.452ms vs 1043.561ms)
- Efficiency: Adds ~0.73 GFLOPs overhead (~1/10 of a CLIP layer), saves 244.5 MB GPU memory per image
- Trade-off: Achieves best balance between latency and accuracy across all three benchmarks
- Hybrid Scoring: Combines global attention patterns, local entropy signals, and text-conditioned mutual-information cues into a single token-importance measure, enabling finer discrimination of salient versus redundant regions without relying solely on vision-side attention.
- Dynamic-K Budgeting: Adapts token retention per image based on information complexity (K = log(Var(scores)) + c) rather than a fixed quota, allowing complex images to preserve more tokens and simple images to preserve fewer.
- Hierarchical Merging: Compresses non-dominant visual tokens through a k-means++ initialization followed by agglomerative clustering, preserving CLIP’s manifold structure while consolidating redundant regions into contextually consistent representations.
- Training-Free: Avoids gradient updates or fine-tuning entirely, functioning as a plug-and-play module on top of existing pretrained multimodal models.
# Sync git submodules
git submodule update --init --recursive
# Install dependencies
pip install -r requirements_A40.txt # or requirements_A40_visionzip.txt
# Login to HuggingFace if needed
huggingface-cli loginTest SparseZip compression on a single image:
python scripts/quick_start/qs_sparsezip_smoke.py \
--cfg config/sparsezip_mme.yaml \
--image reference/owl.JPEG \
--prompt "Describe the owl briefly." \
--clip_only # Remove this flag for full LLaVA generationpython tools/pope_run_all.py \
--cfg config/sparsezip_pope.yaml \
--pope_root /path/to/pope/annotations \
--img_root /path/to/COCO/val2014 \
--out_root ./runs/pope_sparsezip \
--model_path liuhaotian/llava-v1.5-7bpython tools/mme_run_all.py \
--mme_root /path/to/MME_Benchmark \
--out_root ./runs/mme_sparsezip \
--cfg config/sparsezip_mme.yaml \
--only OCRSparseZip reduces visual token count through three main steps:
-
Hybrid Scoring: Each vision token is scored using a weighted combination of:
- Self-attention scores (global salience)
- Feature entropy (local discriminative cues)
- Mutual information proxy (redundancy detection)
-
Dynamic-K Selection: The number of dominant tokens (K) is adapted per image:
- High-complexity images → higher K
- Low-complexity images → lower K
- Formula:
K = round(log(Var(scores) + eps) + c), bounded by k_min/k_max
-
Contextual Merging: Non-dominant tokens are merged into fewer contxtual tokens using:
- k-means++ initialization
- Optional agglomerative hierarchical clustering
- Attention-weighted aggregation
SparseZip is configured via YAML files. Example config:
model:
model_type: sparsezip
model_path: liuhaotian/llava-v1.5-7b
temperature: 0.0
max_new_tokens: 16
sparsezip:
dynamic_k: true
k_min: 32
k_max: 64
dynk:
c: 12.0
eps: 1.0e-3
# Hybrid scoring weights
alphas:
attn: 1.0
entropy: 0.8
mutual: 1.0
tau_feat: 0.15
tau_sim: 0.08
cross_beta: 0.3 # Text-aware cross-attention weight
# Contextual merging
merging:
contextual_num: 32
kmeans_init_factor: 2.0
kmeans_iters: 1
agglomerative: falseSee config/sparsezip_mme.yaml, config/sparsezip_pope.yaml, and config/sparsezip_docvqa.yaml for dataset-specific examples.
This repository provides a modular evaluation framework that supports:
- Multiple Models: VisionZip, SparseVLM, SparseZip, and baseline LLaVA
- Multiple Datasets: MME, POPE, DocVQA, COCO-Caption
- Modular Components: Swap datasets, metrics, or models independently
scripts/
dataset.py # Dataset loaders (MME, POPE, DocVQA)
metric.py # Evaluation metrics (accuracy, precision, recall, etc.)
model.py # Model wrappers (LlavaModel, LlavaSparseZipModel, etc.)
evalkit.py # Unified evaluation orchestrator
abstract.py # Base classes for modular components
tools/
mme_run_all.py # MME benchmark runner
pope_run_all.py # POPE benchmark runner
docvqa_run_all.py # DocVQA runner
config/
sparsezip_*.yaml # SparseZip configurations
sparsevlm_*.yaml # SparseVLM configurations
visionzip_*.yaml # VisionZip configurations
You can customize datasets, metrics, or models by implementing the base classes in scripts/abstract.py:
BaseDataset: Implement__len__and__iter__returningSampleobjectsBaseMetric: Implementupdate()andcompute()methodsBaseModel: Implementdevice(),prepare_inputs(), andgenerate()methods
See the existing implementations in scripts/dataset.py, scripts/metric.py, and scripts/model.py for examples.
# Download MME dataset
python scripts/dataset_download.py --dataset mme --output_dir ./datasets
# Run evaluation
# Note: --mme_root should point to the directory containing subtask folders (existence/, color/, OCR/, etc.)
python tools/mme_run_all.py \
--mme_root ./datasets/mme/MME_Benchmark_release_version/MME_Benchmark \
--out_root ./eval_results/mme_eval \
--cfg config/sparsezip_mme.yaml \
--only OCR POPE annotations are typically generteed from COCO val2014. See tools/pope_run_all.py for setup details.
| Model | DocVQA (val) | POPE | MME | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Latency (ms) ↓ | ANLS ↑ | Exact Match ↑ | Latency (ms) ↓ | Acc. ↑ | F1 ↑ | Prec. ↑ | Rec. ↑ | Latency (ms) ↓ | MME-Acc ↑ | MME-Acc+ ↑ | |
| Llava | 1043.561 | 0.011 | 0.002 | 594.238 | 0.857 | 0.846 | 0.918 | 0.785 | 313.143 | 0.743 | 0.533 |
| SparseZip | 928.452 | 0.010 | 0.002 | 479.987 | 0.837 | 0.816 | 0.940 | 0.721 | 199.748 | 0.722 | 0.492 |
| SparseVLM | 1053.229 | 0.011 | 0.002 | 618.729 | 0.849 | 0.836 | 0.918 | 0.768 | 203.252 | 0.736 | 0.529 |
| VisionZip | 873.286 | 0.009 | 0.001 | 429.646 | 0.798 | 0.758 | 0.943 | 0.634 | 158.775 | 0.705 | 0.462 |
Key Findings:
-
POPE: SparseZip achieves 83.7% accuracy with only 2% drop from Llava baseline (85.7%), while reducing latency by 19.2% (479.987ms vs 594.238ms). SparseZip outperforms VisionZip (79.8%) and is comptitive with SparseVLM (84.9%).
-
MME: SparseZip achieves 36.2% latency reduction (199.748ms vs 313.143ms) while maintaining 72.2% accuracy (vs 74.3% baseline).
-
DocVQA: SparseZip reduces latency by 11.0% (928.452ms vs 1043.561ms) while maintaining comparable ANLS and exact match scores.
-
Overall: SparseZip demonstrates strong latency-accuracy trade-offs across all three benchmarks, with significant speedups (11-36% latency reduction) while preserving most accuracy.
SparseZip includes several components that can be enabled/disable:
- Hybrid Attention: Multi-signal scoring (attention + entropy + mutual information)
- Text-Aware Cross-Attention: Text-conditioned token selection (via MM projector transpose)
- Dynamic-K: Adaptive token budgeting (can be disabled for fixed-K)
- Hierarchical Merging: Contextual token aggregation (can be simplified or disabled)
Configuration flags: skip_hybrid_attn, skip_dynamic_k, skip_ctx_merge in YAML.
For detailed implementation information, see:
docs/README_SPARSEZIP.md- Detailed SparseZip documentationutils/sparsezip.py- Core compression implementationscripts/model.py- Model integration (LlavaSparseZipModel class)
- Python 3.10+
- PyTorch 2.1.2+
- Transformers library
- CUDA-capbale GPU (recommended) or CPU with sufficient RAM
- ~14 GB VRAM for LLaVA-1.5-7B in float16
- HuggingFace authentication errors: Run
huggingface-cli login - Out of memory: Use CPU offloading or reduce batch size
- macOS issues: 4/8-bit quantization not available; install
protobuffor tokenizer - Slow inference: Ensure CUDA is available and model is loaded on GPU
If you use this work, please cite:
A. Geppert, Q. Chen, S. Mittal, N. Maheen Aboobacker (2025). SparseZip: Text-Aware Visual Token Selection and Compression for Efficient Vision-Language Model Inference. Version 1. URL: https://github.com/W1nd55/VisionZip-exp
BibTeX:
@misc{sparsezip2025,
author = {A. Geppert, Q. Chen, S. Mittal, N. Maheen Aboobacker},
title = {SparseZip: Text-Aware Visual Token Selection and Compression for Efficient Vision-Language Model Inference},
year = {2025},
howpublished = {\url{https://github.com/W1nd55/VisionZip-exp}},
note = {Version 1}
}
This project is licensed under the Apache License 2.0. see the LICENSE file for details.
Built on top of LLaVA and integrates with VisionZip and SparseVLM methodologies.
