-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Description
Hi, I'm curious about efficiency in applying online Hadamard operations on Attention outputs as notated 'hadamard heads' in Figure 6. I believe the purpose of this operation is to match rotation size with quantization size as it is most accurate most case. However, I find online operation you mentioned doesn't seem to increase model accuracy compared to no online operations by applying same Hadamard matrix on both Wv and Wout. I mean Wout seems to be a quantization friendly distribution but what was the reason behind applying online hadamard operation after softmax(QKT)V operation?
<style> </style>| Perplexity | W8A8 | W4A8 | W4A4 |
|---|---|---|---|
| QuaRot | 5.481 | 6.701 | 8.097 |
| Single Block | 5.475 | 6.757 | 7.953 |
May I ask whether you found accuracy improvements by applying online operations?
Metadata
Metadata
Assignees
Labels
No labels