the shape of q,k,v

Hello, I noticed that in your code, the projection method of q, k, v is 
`self.W_q = nn.Linear(d_model, 2 * self.d_head * num_heads, bias=False)`

However, in other repository I found they calculate q, k, v as:
`self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)`
[code from this link](https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_diffattn.py)

The shape difference leads to differences in subsequent differential attention calculations.
So I wonder which code is the method [in the paper](https://arxiv.org/abs/2410.05258), or are the two just different ways of writing it?

Thanks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the shape of q,k,v #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

the shape of q,k,v #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions