-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
I was able to run float16 and int4 trt-llm benchmarks with mistral 7B on L4 GPU (GCP), the reported performance is 40.96 ± 0.37 t/s in float16 and 166.02 ± 0.52 t/s with int4, which is significantly faster compared to both exllamav2 and vllm with batch size 1 on llama 3 8B (also int4), and also 2x higher than theoretically possible given the memory bandwidth available.
I did some debugging and believe that reported results are incorrect, in terms of number of generated tokens, e.g. after this line
benchmarks/bench_tensorrtllm/bench.py
Line 101 in 0710037
| output_tokens = output_ids[0][0].detach().cpu().tolist()[num_input_tokens:] |
num_output_tokens as output_tokens.index(2) (which is obviously not a general solution but works for now for mistral), then I get the values which are much closer to vllm and also get same generation speed in the speed test and in the subsequent quality test.Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels