n-waves · benjaminvdb · Feb 20, 2019 · Feb 20, 2019
diff --git a/experiments/dutch/README.md b/experiments/dutch/README.md
@@ -0,0 +1,26 @@
+# ULMFiT experiments for Dutch
+
+````
+Author:      Benjamin van der Burgh
+Email:       b.van.der.burgh@liacs.leidenuniv.nl
+Affiliation: LIACS, Universiteit Leiden, Leiden, The Netherlands
+````
+
+
+## Description
+
+This folder contains experiment that were done with ULMFiT and compared against various baselines. The dataset that was used is [110kDBRD](https://github.com/benjaminvdb/110kDBRD)
+
+## LM weights
+
+LM trained on Dutch Wikipedia: http://bit.ly/2trOhzq
+
+## Results
+
+````
+ULMFiT pre-trained:     93.84%
+ULMFiT no pre-train:    92.55%
+SVM:                    89.16%
+Flair with fastText:    88.48%
+fastText:               80.90%
+````
diff --git a/experiments/dutch/fastText/README.md b/experiments/dutch/fastText/README.md
@@ -0,0 +1,52 @@
+# fastText classifier baseline
+
+## Description
+
+This folder contains scripts that were used to obtain a baseline for the sentiment polarity classification task.
+
+## fastText
+
+### Install
+
+We'll be using the command-line tool, which supports using pre-trained word embeddings. Instructions for downloading and building fastText can be found here: https://github.com/facebookresearch/fastText
+
+### Word embeddings
+
+Pre-trained word embeddings for Dutch can be downloaded from: https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.nl.zip
+
+Extract them to the current directory: `unzip wiki.nl.zip`
+
+## Dataset
+
+### Download
+
+The experiments were run on 110kDBRD dataset, which can be downloaded from here: https://github.com/benjaminvdb/110kDBRD
+
+### Convert
+
+The 110kDBRD dataset is in a different format and needs to be converted first. Run `prepare.py` to convert the *extracted* dataset and save it to the current directory.
+
+````
+python ./prepare.py /path/to/110kDBRD
+```` 
+
+### Modelling
+
+## Train
+
+````
+./train.sh train.txt ./wiki.nl.vec
+Read 26M words
+Number of words:  665350
+Number of labels: 2
+Progress: 100.0% words/sec/thread:  337040 lr:  0.000000 loss:  0.074446 ETA:   0h 0m
+````
+
+## Test
+
+````
+./predict.sh test.txt
+N	10972
+P@1	0.809
+R@1	0.809
+````
diff --git a/experiments/dutch/fastText/predict.sh b/experiments/dutch/fastText/predict.sh
@@ -0,0 +1,7 @@
+#!/usr/bin/env sh
+
+TEST_FILE=$1
+
+./fasttext test \
+	model.bin \
+	$TEST_FILE 1
diff --git a/experiments/dutch/fastText/prepare.py b/experiments/dutch/fastText/prepare.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python2
+
+import os
+import sys
+import codecs
+import re
+import tarfile
+
+from sklearn.datasets import load_files
+
+
+def convert(input_dir, output_file):
+    """
+    Convert 110kDBRD dataset into fastText compatible format.
+    """
+    regex = re.compile(r'\s+')
+    dataset = load_files(input_dir, encoding='utf-8')
+    with codecs.open(output_file, 'w', encoding='utf-8') as f:
+        buff = u'\n'.join([u'__label__{} {}'.format(target, regex.sub(' ', text).strip()) for target, text in zip(dataset.target, dataset.data)])
+        f.write(buff)
+
+
+def main():
+    """
+    Expects the root of 110kDBRD as input argument. Converts and saves to ./train.txt and ./test.txt
+    """
+    base_dir = sys.argv[1]
+    train_dir = os.path.join(base_dir, 'train')
+    test_dir = os.path.join(base_dir, 'test')
+
+    convert(train_dir, 'train.txt')
+    convert(test_dir, 'test.txt')
+
+
+if __name__ == '__main__':
+    main()
diff --git a/experiments/dutch/fastText/requirements.txt b/experiments/dutch/fastText/requirements.txt
@@ -0,0 +1 @@
+scikit-learn==0.20.1
diff --git a/experiments/dutch/fastText/train.sh b/experiments/dutch/fastText/train.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env sh
+
+TRAIN_FILE=$1
+PRETRAINED=$2
+
+./fasttext supervised \
+	-input $TRAIN_FILE \
+	-output model \
+	-epoch 25 \
+	-wordNgrams 4 \
+	-dim 300 \
+	-loss hs \
+	-thread 7 \
+	-minCount 1 \
+	-lr 1.0 \
+	-verbose 2 \
+	-pretrainedVectors $PRETRAINED
diff --git a/experiments/dutch/flair/README.md b/experiments/dutch/flair/README.md
@@ -0,0 +1,7 @@
+# SVM classifier baseline
+
+## Description
+
+[Flair](https://github.com/zalandoresearch/flair) with [fastText](https://github.com/facebookresearch/fastText) Dutch word embeddings was used to obtain a baseline for the sentiment polarity classification task.
+
+The folder simply includes a notebook that shows the experiments and its results.