Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions experiments/dutch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# ULMFiT experiments for Dutch

````
Author: Benjamin van der Burgh
Email: b.van.der.burgh@liacs.leidenuniv.nl
Affiliation: LIACS, Universiteit Leiden, Leiden, The Netherlands
````


## Description

This folder contains experiment that were done with ULMFiT and compared against various baselines. The dataset that was used is [110kDBRD](https://github.com/benjaminvdb/110kDBRD)

## LM weights

LM trained on Dutch Wikipedia: http://bit.ly/2trOhzq

## Results

````
ULMFiT pre-trained: 93.84%
ULMFiT no pre-train: 92.55%
SVM: 89.16%
Flair with fastText: 88.48%
fastText: 80.90%
````
52 changes: 52 additions & 0 deletions experiments/dutch/fastText/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# fastText classifier baseline

## Description

This folder contains scripts that were used to obtain a baseline for the sentiment polarity classification task.

## fastText

### Install

We'll be using the command-line tool, which supports using pre-trained word embeddings. Instructions for downloading and building fastText can be found here: https://github.com/facebookresearch/fastText

### Word embeddings

Pre-trained word embeddings for Dutch can be downloaded from: https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.nl.zip

Extract them to the current directory: `unzip wiki.nl.zip`

## Dataset

### Download

The experiments were run on 110kDBRD dataset, which can be downloaded from here: https://github.com/benjaminvdb/110kDBRD

### Convert

The 110kDBRD dataset is in a different format and needs to be converted first. Run `prepare.py` to convert the *extracted* dataset and save it to the current directory.

````
python ./prepare.py /path/to/110kDBRD
````

### Modelling

## Train

````
./train.sh train.txt ./wiki.nl.vec
Read 26M words
Number of words: 665350
Number of labels: 2
Progress: 100.0% words/sec/thread: 337040 lr: 0.000000 loss: 0.074446 ETA: 0h 0m
````

## Test

````
./predict.sh test.txt
N 10972
P@1 0.809
R@1 0.809
````
7 changes: 7 additions & 0 deletions experiments/dutch/fastText/predict.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env sh

TEST_FILE=$1

./fasttext test \
model.bin \
$TEST_FILE 1
36 changes: 36 additions & 0 deletions experiments/dutch/fastText/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env python2

import os
import sys
import codecs
import re
import tarfile

from sklearn.datasets import load_files


def convert(input_dir, output_file):
"""
Convert 110kDBRD dataset into fastText compatible format.
"""
regex = re.compile(r'\s+')
dataset = load_files(input_dir, encoding='utf-8')
with codecs.open(output_file, 'w', encoding='utf-8') as f:
buff = u'\n'.join([u'__label__{} {}'.format(target, regex.sub(' ', text).strip()) for target, text in zip(dataset.target, dataset.data)])
f.write(buff)


def main():
"""
Expects the root of 110kDBRD as input argument. Converts and saves to ./train.txt and ./test.txt
"""
base_dir = sys.argv[1]
train_dir = os.path.join(base_dir, 'train')
test_dir = os.path.join(base_dir, 'test')

convert(train_dir, 'train.txt')
convert(test_dir, 'test.txt')


if __name__ == '__main__':
main()
1 change: 1 addition & 0 deletions experiments/dutch/fastText/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
scikit-learn==0.20.1
17 changes: 17 additions & 0 deletions experiments/dutch/fastText/train.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env sh

TRAIN_FILE=$1
PRETRAINED=$2

./fasttext supervised \
-input $TRAIN_FILE \
-output model \
-epoch 25 \
-wordNgrams 4 \
-dim 300 \
-loss hs \
-thread 7 \
-minCount 1 \
-lr 1.0 \
-verbose 2 \
-pretrainedVectors $PRETRAINED
7 changes: 7 additions & 0 deletions experiments/dutch/flair/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# SVM classifier baseline

## Description

[Flair](https://github.com/zalandoresearch/flair) with [fastText](https://github.com/facebookresearch/fastText) Dutch word embeddings was used to obtain a baseline for the sentiment polarity classification task.

The folder simply includes a notebook that shows the experiments and its results.
Loading