Use exllamav2's smart 4-bit KV cache for memory benchmark

See: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

exllamav2 has a 4-bit KV cache that has similar ppl to unquantized cache from turboderp's testing. In practice, I find that exllamav2 uses less VRAM than llama.cpp for a given context size as a result. I noticed the exllamav2 benchmark code uses the unquantized cache. Could it be possible to use the 4-bit KV cache again for the memory usage benchmark? Thanks.

For reference, here's the class to use instead: https://github.com/turboderp/exllamav2/blob/009424a6d42d39efceeecd5562450180bd34a7fb/exllamav2/cache.py#L309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use exllamav2's smart 4-bit KV cache for memory benchmark #185

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions