Skip to content

Genre classification using BERT pretrained embedding

Notifications You must be signed in to change notification settings

Stone-bridge-NLP/BERT

Repository files navigation

BERT - Revision 2025

GenreClassifcation_BERT_v2 Open In Colab

key difference

  • Download dataset directly from Kaggle
  • Fully connected classifier using CLS token
  • Undersampling / Oversampling / Class Weights for extremely imbalanced data
  • Improved validation and visualization

Metrics

Scenario 1: Undersampling Scenario 2: Oversampling Scenario 3: Class Weights

Summary

  • Undersampling showed to be most effective and efficient strategy for improving performance on minority classes in such extremely imbalanced dataset.
  • Class weights provided efficient improvements for majority classes.
  • Oversampling was neither effective nor efficient, since it gave no meaningful benefit over class weights while wasting computation resources on oversampled duplicates.

Total running time (T4 GPU): preprocessing (3 min) / Scenario 1 (1h) / Scenario 2 (2h) / Scenario 3 (1.5h)


BERT

Genre classification by lyrics with word2vec embedding.

Code

  • Genre classification with BERT embedding and LSTM classifier. Open In Colab

  • Test Demo of genre classification with BERT embedding and LSTM classifier. Open In Colab

  • Simple implementation of CNN classifier. Open In Colab

  • Utils.py (for saving model and downsampling dataset)
    utils.py

Reference

  • Huggingface main : Huggingface
  • Huggingface hub usage : Open In Colab

About

Genre classification using BERT pretrained embedding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published