-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
Hello, I noticed that in your code, the projection method of q, k, v is
self.W_q = nn.Linear(d_model, 2 * self.d_head * num_heads, bias=False)
However, in other repository I found they calculate q, k, v as:
self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
code from this link
The shape difference leads to differences in subsequent differential attention calculations.
So I wonder which code is the method in the paper, or are the two just different ways of writing it?
Thanks.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels