Skip to content

MaayanMashhadi/LanguageModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

LanguageModel

Language modeling is the task of predicting the next word in a sequence of words. In this project, I used the IMDB dataset, preprocessed the text, built a vocabulary, and trained a language model.

Tasks:

Organized the IMDB dataset to fit a language modeling (LM) task by formatting each sample as a sequence of words for next-word prediction.

Split the dataset into training, validation, and test sets to enable proper model evaluation and prevent overfitting.

Used the TreebankWordTokenizer from NLTK to tokenize the raw text into sequences of tokens suitable for language modeling.

Created a vocabulary from the training data, mapping each unique token to a numerical index.

Defined a custom Dataset class along with a DataLoader to efficiently load and batch the tokenized data during training.

Defined a language model using a recurrent neural network.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published