-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
I followed the ‘retaining only the simple fusion operation of input audio, video, and question features’ as described in the text, but got much less than the results described in the table.
def forward(self, audio, visual,question):
audio_feat = self.input_a(audio) # [B, T, C]
visual_feat = self.input_v(visual) # [B, T, C]
qst_feat = self.input_qst(question).squeeze(-2) # [B, C]
### 2. Fusion **************************************************************************
av_feat = torch.cat((audio_feat, visual_feat), dim=1)
av_feat = self.av_fusion_tanh(av_feat)
av_feat = self.av_fusion_fc(av_feat)
av_feat = av_feat.mean(dim=-2)
avq_feat = torch.mul(av_feat, qst_feat) # [batch_size, embed_size]
avq_feat = self.avq_fusion_tanh(avq_feat)
### 3. Answer prediction moudule *************************************************************
answer_pred = self.answer_pred_fc(avq_feat) # [batch_size, ans_vocab_size=42]
return answer_pred
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels