key difference
- Download dataset directly from Kaggle
- Fully connected classifier using CLS token
- Undersampling / Oversampling / Class Weights for extremely imbalanced data
- Improved validation and visualization
| Scenario 1: Undersampling | Scenario 2: Oversampling | Scenario 3: Class Weights |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Undersampling showed to be most effective and efficient strategy for improving performance on minority classes in such extremely imbalanced dataset.
- Class weights provided efficient improvements for majority classes.
- Oversampling was neither effective nor efficient, since it gave no meaningful benefit over class weights while wasting computation resources on oversampled duplicates.
Total running time (T4 GPU): preprocessing (3 min) / Scenario 1 (1h) / Scenario 2 (2h) / Scenario 3 (1.5h)
Genre classification by lyrics with word2vec embedding.
-
Genre classification with BERT embedding and LSTM classifier.
-
Test Demo of genre classification with BERT embedding and LSTM classifier.
-
Utils.py (for saving model and downsampling dataset)
utils.py
- Huggingface main : Huggingface
- Huggingface hub usage :








