See: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md
exllamav2 has a 4-bit KV cache that has similar ppl to unquantized cache from turboderp's testing. In practice, I find that exllamav2 uses less VRAM than llama.cpp for a given context size as a result. I noticed the exllamav2 benchmark code uses the unquantized cache. Could it be possible to use the 4-bit KV cache again for the memory usage benchmark? Thanks.
For reference, here's the class to use instead: https://github.com/turboderp/exllamav2/blob/009424a6d42d39efceeecd5562450180bd34a7fb/exllamav2/cache.py#L309