Hi @coder4nlp Here is my test result,

Test Env:

Hardware:

H20 single card

DashInfer

the latest source code.

Dashinfer_vlm

from latest source code.

benchmark command：

basiclly download the data in dash-infer/multimodal/README.md , the data put in tests/data folder:
script:

wget https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data/resolve/main/opensource/docvqa_train_10k.jsonl
wget https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data/resolve/main/data/share_textvqa.zip
unzip share_textvqa.zip

benchmark command

start the server with port:

python tests/benchmark_openai_api.py --prompt-file tests/data/docvqa_train_10k.jsonl --image-folder tests/data/share_textvqa/images/ --req-nums 100 \
        --batch-size 32 \
        --image-nums-mean 3 \
        --image-nums-range 1  \
        --response-mean 120 \
        --response-len-range 64

it will run single image test with batch size (concurrencly 32)

Server command:

run the serve under: folder:
dash-infer/multimodal

Because the data in local file, the related path should be accessable, so my file dir is like this for your reference:
dash-infer/multimodal# tree -L 3

Qwen2-VL-2B-Instruct

start command (vit use transformers):

dashinfer_vlm_serve --model /model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine transformers

Result:

1st time :

Total time: 38.10 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 12.78 (average) / 1278 (total) --- 
QPS:  2.62 requests/sec, TPS: 33.54 tokens/sec

2nd time (with vit cache):

Total time: 8.72 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 10.72 (average) / 1072 (total) --- 
QPS:  11.47 requests/sec, TPS: 122.97 tokens/sec

start command (Vit use TRT):

dashinfer_vlm_serve --model /model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt

Result:

1st time :

Total time: 32.99 sec
input token lens: 2783.60 (average) / 278360 (total) --- output token lens: 11.33 (average) / 1133 (total) --- 
QPS:  3.03 requests/sec, TPS: 34.35 tokens/sec

2nd time (with vit cache):

Total time: 8.76 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 9.37 (average) / 937 (total) --- 
QPS:  11.42 requests/sec, TPS: 106.97 tokens/sec

start command (ViT use TRT + FP8 dynamic quant)

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --quant-type fp8

result

1st time :

Total time: 29.88 secinput token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 11.52 (average) / 1152 (total) --- QPS:  3.35 requests/sec, TPS: 38.55 tokens/sec

2nd time (with vit cache):

Total time: 6.84 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 9.30 (average) / 930 (total) --- QPS:  14.61 requests/sec, TPS: 135.88 tokens/sec

start command (ViT use TRT + FP8 dynamic quant + prefix cache)

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2-VL-2B-Instruct/ --host 127.0.0.1 --vision_engine tensorrt --enable-prefix-cache --quant-type fp8

result

1st time :

Total time: 29.53 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 10.95 (average) / 1095 (total) --- 
QPS:  3.39 requests/sec, TPS: 37.09 tokens/sec

2nd time (with vit cache):

Total time: 2.81 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 11.25 (average) / 1125 (total) --- QPS:  35.64 requests/sec, TPS: 400.96 tokens/sec

Vllm

version：v0.9.2rc2 + (82b8027be6e8f15603cea823e044069cd10c9c62)
start command:

uv run vllm serve /model/Qwen2-VL-2B-Instruct/ --limit-mm-per-prompt '{"image":4}' --allowed-local-media-path `my_image_paths`

1st time :

Total time: 33.14 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 18.24 (average) / 1824 (total) --- 
QPS:  3.02 requests/sec, TPS: 55.03 tokens/sec

2nd time (with vit cache):

Total time: 6.10 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 18.80 (average) / 1880 (total) --- 
QPS:  16.38 requests/sec, TPS: 307.96 tokens/sec

# Qwen2.5-VL-2B-Instruct

dashinfer_vlm_serve --model /cfs_cloud_code/asherszhang/model/Qwen2.5-VL-3B-Instruct/  --host 127.0.0.1 --vision_engine transformers

dashinfer (vit use transfomers)

1st time :

Total time: 93.34 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 19.42 (average) / 1942 (total) --- 
QPS:  1.07 requests/sec, TPS: 20.81 tokens/sec

2nd time (with vit cache):

Total time: 16.74 sec
input token lens: 2783.60 (average) / 278360 (total) --- 
output token lens: 14.86 (average) / 1486 (total) --- 
QPS:  5.97 requests/sec, TPS: 88.75 tokens/sec

vllm start with mm cache

1st time:

Total time: 36.99 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 29.55 (average) / 2955 (total) --- 
QPS:  2.70 requests/sec, TPS: 79.89 tokens/sec

2nd time (with vit cache)

Total time: 5.64 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 28.81 (average) / 2881 (total) --- 
QPS:  17.73 requests/sec, TPS: 510.89 tokens/sec

510 token/s generation, it must be some kind of cache. it will takes 3T memory bandwidth , which already greater than H20's hardware limit.

So I start test with disable cache.

vllm start without mm cache

cmd:

uv run vllm serve /model/Qwen2.5-VL-3B-Instruct/ --limit-mm-per-prompt '{"image":4}' --allowed-local-media-path `path-to-data` --no-enable-prefix-caching --disable-mm-preprocessor-cache

1st time:

Total time: 37.25 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 29.09 (average) / 2909 (total) --- 
QPS:  2.68 requests/sec, TPS: 78.09 tokens/sec

2nd time:

Total time: 31.88 sec
input token lens: 2794.60 (average) / 279460 (total) --- 
output token lens: 29.03 (average) / 2903 (total) --- 
QPS:  3.14 requests/sec, TPS: 91.05 tokens/sec

After disable cache , the data going to normal.

2.70 QPS vs 2.68 in 1st time request, there is no much different in current version vllm.

@coder4nlp for your test, I think because you're using same prompt + same image, which cause a lots of caching, special mm cache takes effect.

The inference time of qwen2.5-vl is very slow. #89

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions