Skip to content

Feature/mlm#22

Open
bobbycxy wants to merge 3 commits intomainfrom
feature/mlm
Open

Feature/mlm#22
bobbycxy wants to merge 3 commits intomainfrom
feature/mlm

Conversation

@bobbycxy
Copy link
Collaborator

@bobbycxy bobbycxy commented Jun 5, 2024

  1. Added: masked language modelling (mlm) dataloader - called MLMDataloader - that masks inputs (80% masked, 10% randomised, 10% untouched),
  2. updated the cross_entropy to ignore ignore_index=-1, because -1 is the masked token id we'll use in the target (y) labels.
  3. updated the build_trainer.py to have MLMDataloader in the dataloader dictionary.
  4. created a mlm specific config.yaml file.

@DylanASHillier
Copy link
Collaborator

@LeonGuertler this is a beautful example of how a PR should be --> list of changes

logits = logits.view(-1, logits.size(-1))
y = y.view(-1)
loss = torch.nn.functional.cross_entropy(logits, y, reduction="none")
loss = torch.nn.functional.cross_entropy(logits, y, reduction="none", ignore_index=-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are 100% sure this has no impacts elsewhere? should be okay but..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Ignore_index's default value is -100. So, changing it from -100 to -1 would only impact existing label (y) assignments of -1 or -100.

  1. I searched our code base for the assignment of -100, there was none.
  2. I searched our code base for the assignment of -1. Apart from the new MLMDataLoader using label[~mask]=-1, other assignments were used in arguments such as 'dim=-1', or just as index values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

embedder:
tokenizer_type: gpt2
embedding_model_type: generic
dataset_name: simple_en_wiki
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to stlm

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, can I clarify what should be changed to stlm?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the dataset_name, but its okay, this will all be reworked later

@DylanASHillier
Copy link
Collaborator

I propose closing this, as we decided not to do MLM for now... feel free to rework this if you like, should be pretty simple to update, and happy to merge in since it doesn't add much code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants