Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
123 commits
Select commit Hold shift + click to select a range
b60c782
Add files via upload
anya-bel Jan 29, 2019
fc42683
Delete dialog2016_bank.csv
anya-bel Jan 30, 2019
63113fc
Delete dialog2016_tkk.csv
anya-bel Jan 30, 2019
ec8191b
Sentiment analysis from Dialog-2016 with 3 classes
anya-bel Jan 30, 2019
4152574
Delete dialog2016_bank.csv
anya-bel Jan 30, 2019
2af422e
Delete dialog2016_tkk.csv
anya-bel Jan 30, 2019
8db5d32
Sentiment analysis from Dialog-2016 with 3 classes
anya-bel Jan 30, 2019
52947fa
Delete dialog2016_bank.csv
anya-bel Jan 30, 2019
c47d3ae
Delete dialog2016_tkk.csv
anya-bel Jan 30, 2019
40da665
Create init.py
anya-bel Jan 30, 2019
d57c1cd
Sentiment analysis from Dialogue-16 with 3 classes
anya-bel Jan 30, 2019
867970e
Delete init.py
anya-bel Jan 30, 2019
46dda0f
Merge pull request #4 from comptechml/dialog_data
anya-bel Jan 30, 2019
1c2539d
Add datasets for rubric, readability and tag classifiers
anya-bel Jan 31, 2019
0c6d041
Merge pull request #5 from comptechml/taiga-rubric
anya-bel Jan 31, 2019
a0511a2
Added .csv for paraphrase detection with class 0 (data from dialog2016)
lbdlbdlbdl Jan 31, 2019
54251a1
Add example for new embedding
AlekseyPauls Jan 29, 2019
b051d12
Merge branch 'SentEvalRu'
AlekseyPauls Jan 31, 2019
3ebb59c
Merge remote-tracking branch 'origin/dialog_data'
AlekseyPauls Jan 31, 2019
f089443
Add files via upload
mishasweetpie Feb 1, 2019
c42dc02
Add fasttext-idf and dataset for paraphrases
AlekseyPauls Feb 1, 2019
308b364
Add dataset for sst5
AlekseyPauls Feb 1, 2019
2bb5a66
Add files via upload
comptechml Feb 1, 2019
1d01287
Merge pull request #7 from comptechml/kinopoisk
comptechml Feb 1, 2019
d5e296c
Delete kinopoisk.csv
comptechml Feb 1, 2019
cc12b1f
Delete kinopoisk.csv
comptechml Feb 1, 2019
d2d63bc
Create Genre classification
comptechml Feb 1, 2019
824e8f3
Delete Genre classification
mishasweetpie Feb 1, 2019
307056d
add dataset from kinopoisk.ru for genre classification
mishasweetpie Feb 1, 2019
972987d
Merge pull request #9 from comptechml/kinopoisk
mishasweetpie Feb 2, 2019
1c939df
So much changes, so lazy to commit
AlekseyPauls Feb 2, 2019
52380b2
new dataset from kinopoisk.ru for genre classification
mishasweetpie Feb 2, 2019
d8c8f2c
Merge pull request #10 from MikhailGarila/kinopoisk
mishasweetpie Feb 2, 2019
be57f1c
Update documentation
AlekseyPauls Feb 2, 2019
d24dfea
Merge remote-tracking branch 'origin/master'
Feb 2, 2019
4dc0318
baf fixed at data\readme
lbdlbdlbdl Feb 2, 2019
89007ef
bag fixed at data\README
lbdlbdlbdl Feb 2, 2019
5eb9c98
Add example for new embedding(bert)
nshugalevkaia Feb 2, 2019
ede0835
Add example for new embedding(bert)
nshugalevkaia Feb 2, 2019
8051feb
Merge branch 'master' of https://github.com/comptechml/russian_benchmark
nshugalevkaia Feb 2, 2019
8e481ff
Universal Sentence Encoder with translator
KaterinaTimasova Feb 2, 2019
ad7b25b
Merge branch 'master' of https://github.com/comptechml/russian_benchmark
KaterinaTimasova Feb 2, 2019
9769002
open and use USE with translator
KaterinaTimasova Feb 2, 2019
24b3e22
Add tests and new datasets
AlekseyPauls Feb 2, 2019
cc69b1c
Parsers for taiga and dialogue 2016
anya-bel Feb 3, 2019
2a67bcb
Update readme, add demo
AlekseyPauls Feb 3, 2019
60c5106
edited список текущих задач in readme
lbdlbdlbdl Feb 3, 2019
f59369c
Added parsers for koziev, rubtsova
lbdlbdlbdl Feb 3, 2019
4cb2fb3
Merge pull request #14 from comptechml/helpers
lbdlbdlbdl Feb 3, 2019
a02bff8
del not completed doc
lbdlbdlbdl Feb 3, 2019
4c7a876
Update README.md
lbdlbdlbdl Feb 3, 2019
dc0320b
Update README.md
lbdlbdlbdl Feb 3, 2019
c907238
Update README.md
lbdlbdlbdl Feb 3, 2019
8956eaa
Update README.md
lbdlbdlbdl Feb 3, 2019
5bb7bd6
Delete README.ru.md
lbdlbdlbdl Feb 3, 2019
46db3cf
Updates after USE and Bert fixing.
bond005 Feb 5, 2019
18aa471
Merge pull request #15 from comptechml/skip_thougths_fixes
AlekseyPauls Feb 5, 2019
643a5e9
Requirements.txt have been added.
bond005 Feb 5, 2019
e769acd
Extra files have been removed.
bond005 Feb 5, 2019
782aa84
Requirements.txt have been modified.
bond005 Feb 5, 2019
470a096
All examples have been standartized.
bond005 Feb 6, 2019
159c8ed
Data directories for examples have been added.
bond005 Feb 6, 2019
e1d058c
Automatic downloading of the SkipThought model has been added.
bond005 Feb 6, 2019
25cd166
Automatic downloading of the SkipThought model has been added.
bond005 Feb 6, 2019
0ede1fb
tensorflow-gpu has been added into requirements.txt.
bond005 Feb 6, 2019
1a6cbfd
Automatic downloading of the FastText model has been added.
bond005 Feb 6, 2019
8165c5d
Some fixes in skip_thoughts.py.
bond005 Feb 8, 2019
d96a593
Pathes in all example scripts has been standartized.
bond005 Feb 8, 2019
dbb0875
Data directory for the Bert models has been added.
bond005 Feb 8, 2019
5b21c95
Some bugs in working with Bert have been fixed.
bond005 Feb 8, 2019
ad546a7
Using of Bert embeddings has been improved.
bond005 Feb 10, 2019
35c6dea
Using of Bert embeddings has been improved.
bond005 Feb 10, 2019
b9f67f0
Using of Bert embeddings has been improved.
bond005 Feb 10, 2019
a5106fd
Using of Bert embeddings has been improved.
bond005 Feb 10, 2019
ecffe2d
Using of Bert embeddings has been improved.
bond005 Feb 10, 2019
ceb11d1
Working with Bert has been based on the keras-bert library.
bond005 Feb 10, 2019
8886709
Working with Bert has been based on the keras-bert library.
bond005 Feb 10, 2019
12cf05b
Working with Bert has been based on the keras-bert library.
bond005 Feb 10, 2019
35dd6b2
Working with Bert has been based on the keras-bert library.
bond005 Feb 10, 2019
2c8a3e2
Working with Bert has been based on the keras-bert library.
bond005 Feb 10, 2019
5cc0427
Limit to GPU memory utilization has been added.
bond005 Feb 11, 2019
62d238b
FullTokenizer from the Google's Bert has been added.
bond005 Feb 11, 2019
f21f98a
FullTokenizer from the Google's Bert has been added.
bond005 Feb 11, 2019
2db67ea
Fix tasks list, remove unusing files, rework fasttext-idf (not succes…
AlekseyPauls Feb 12, 2019
55bc30f
Time tracking
AlekseyPauls Feb 13, 2019
7b691f3
Bert embeddings have been fixed.
bond005 Feb 15, 2019
020517a
Merge pull request #16 from comptechml/Refactoring
bond005 Feb 17, 2019
151b34b
The USE has been removed.
bond005 Feb 19, 2019
6304b61
Merge pull request #17 from comptechml/Refactoring
bond005 Feb 19, 2019
c4e705f
Update README.md
anya-bel Feb 20, 2019
771427e
Update README.md
anya-bel Feb 20, 2019
6c47ff6
Update README.md
anya-bel Feb 20, 2019
0da3c2f
Update README.md
anya-bel Feb 20, 2019
b521a68
Update README.md
anya-bel Feb 20, 2019
ff3fe33
Merge pull request #18 from comptechml/documentation2.0
anya-bel Feb 20, 2019
952411a
Update README.md
anya-bel Feb 20, 2019
778ee82
Create READMEen.md
anya-bel Feb 20, 2019
29ce900
Update README.md
anya-bel Feb 20, 2019
44c3456
Update READMEen.md
anya-bel Feb 20, 2019
45bd136
Update READMEen.md
anya-bel Feb 20, 2019
7025c2e
Update READMEen.md
anya-bel Feb 20, 2019
985b6d2
Update READMEen.md
anya-bel Feb 20, 2019
815b66d
Update READMEen.md
anya-bel Feb 20, 2019
7570e0d
Update READMEen.md
anya-bel Feb 20, 2019
531cdd6
Update READMEen.md
anya-bel Feb 20, 2019
838c06e
Update READMEen.md
anya-bel Feb 20, 2019
fbfbb5e
Update READMEen.md
anya-bel Feb 20, 2019
8ebbee9
Update README.md
anya-bel Feb 20, 2019
5c07955
Update READMEen.md
anya-bel Feb 20, 2019
20fa974
Update READMEen.md
anya-bel Feb 20, 2019
2ec0ba2
Update README.md
anya-bel Feb 20, 2019
3c359d6
Update READMEen.md
anya-bel Feb 20, 2019
96b733d
Update READMEen.md
anya-bel Feb 20, 2019
809ef68
Update README.md
anya-bel Feb 20, 2019
266d927
Update READMEen.md
anya-bel Feb 20, 2019
8ebc252
Update READMEen.md
anya-bel Feb 20, 2019
276ba6c
Update READMEen.md
anya-bel Feb 20, 2019
1d26f8b
Merge branch 'master' into documentation2.0
anya-bel Feb 20, 2019
204d2d3
Merge pull request #19 from comptechml/documentation2.0
anya-bel Feb 20, 2019
a31f2bc
Update README.md
anya-bel Feb 20, 2019
2d9d7a4
Update READMEen.md
anya-bel Feb 20, 2019
c59dd17
Merge pull request #20 from comptechml/documentation2.0
anya-bel Feb 20, 2019
e96f712
Update README.md
bond005 Mar 21, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

.idea
senteval/__pycache__
senteval/tools/__pycache__
examples/fasttext
examples/glove
examples/skip-thoughts-files
data/MRPC/paraphrases.csv
data/MRPC/paraphrases_label_0.csv
data/MRPC/paraphrases_label_0_corrected.csv
data/MRPC/partition.py
data/Readability classifier/clean.csv
data/Readability classifier/nplus_readability.csv
data/Readability classifier/script.py
data/Tags classifier/interfax_tag.csv
data/Tags classifier/script.py
data/Readability classifier/script.py
data/SST/binary/convert.py
data/SST/binary/negatives.csv
data/SST/binary/positives.csv
data/SST/dialog-2016/script.py
data/SST/dialog-2016/dialog2016.csv
data/Rubric classifier
data/translated
data/Poems classifier/script.py
data/Poems classifier/poems_genre.csv
data/Proza classifier/proza_rubric.csv
data/Proza classifier/script.py
291 changes: 117 additions & 174 deletions README.md

Large diffs are not rendered by default.

191 changes: 191 additions & 0 deletions READMEen.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# SentEvalRu

[Russian](https://github.com/comptechml/SentEvalRu/blob/master/README.md)|[English](https://github.com/comptechml/SentEvalRu/blob/master/READMEen.md)
-
This project was dedicated to creating a library for evaluating the quality of [sentence embeddings](https://en.wikipedia.org/wiki/Sentence_embedding) for the russian language. We assess their generalization power by using them as features on a broad and diverse set of tasks. SentEvalRu currently includes 17 NLP tasks.

**Our goal** is to evaluate different algorithms of text representation with datasets in russian. This is the first approach of such task for the russian language.
We were inspired to create this library by [SentEval](https://arxiv.org/abs/1803.05449)[1].

This project was implemented in the context of winter school [ComptechNsk'19](http://comptech.nsk.su/), the idea of creating SentEvalRu belongs to [MIPT](https://mipt.ru/english/)'s Neural Networks and Deep Learning Lab who develops artificial intelligence system [iPavlov](https://ipavlov.ai/).

Project participants:
- [Mosolova Anna](https://github.com/anya-bel) (project manager)
- [Obukhova Alisa](https://github.com/lbdlbdlbdl) (technical writer)
- [Pauls Aleksey](https://github.com/AlekseyPauls) (engineer)
- [Stroganov Mikhail](https://github.com/MikhailStroganov) (engineer)
- [Timasova Ekaterina](https://github.com/KaterinaTimasova) (researcher)
- [Shugalevskaya Natalya](https://github.com/nshugalevkaia) (researcher)

### What tasks does it help to solve?
*Sentence embeddings* are used in a wide range of tasks where NLP systems are required. For example:
- intent classifier;
- [QA systems](https://en.wikipedia.org/wiki/Question_answering);
- [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis);
- [machine translation](https://en.wikipedia.org/wiki/Machine_translation);
- [document clustering](https://en.wikipedia.org/wiki/Document_clustering).

Our tool helps to evaluate sentence embeddings, and that could be useful for everyone, who solves these tasks or analyses embeddings' quality for russian scientifically.

### Available models of text representation

Our project currently includes following models of text representation for the russian language:
- [Bert](https://arxiv.org/pdf/1810.04805.pdf) [2]
- [FastText](https://fasttext.cc/) [4]
- [FastText](https://fasttext.cc/)+[IDF](https://en.wikipedia.org/wiki/Tf–idf) [4] [5]
- [Skip-Thought](https://arxiv.org/abs/1506.06726) [6]

## Evaluation and tasks
There is no way to evaluate embeddings' quality direclty so we can only solve some NLP tasks using these embeddings and evaluate them depending on the results of these systems.

For example, we can use following tasks:
- sentiment analysis;
- named-entity recognition;
- topic modelling;
etc.

We suggest evaluating embeddings by means of these tasks:

|Tag| Task | Type | Description |
|----------------|-------------|---------------------------|--------------------------------|
|MRPC| [MRPC](https://github.com/Koziev/NLP_Datasets/tree/master/ParaphraseDetection/Data) | paraphrase detection | Detect whether one sentence is the paraphrase of another one |
|SST-3| [SST/dialog-2016](http://www.dialog-21.ru/evaluation/2016/sentiment/) | ternary sentiment analysis | Detect a text sentiment (positive (1), neutral (0), negative (-1))|
|SST-2| [SST/binary](http://study.mokoron.com/) | binary sentiment analysis | Detect a text sentiment (positive (1), negative (-1)) |
|TagCl| [Tags classifier](https://tatianashavrina.github.io/taiga_site/downloads) | tag classifier | Detect a tag of news from Interfax Corpus |
|ReadabilityCl| [Readability classifier](https://tatianashavrina.github.io/taiga_site/downloads) | Readability Classifier | Detect a readibility grade of a text (1-10) |
|PoemsCl| [Poems classifier](https://tatianashavrina.github.io/taiga_site/) | genre classifier | Detect poem's genre |
|ProzaCl| [Proza classifier](https://tatianashavrina.github.io/taiga_site/) | genre classifier | Detect prose's genre |
|TREC| [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) (translated) | question-type classification | Detect a type of a question (about entity, human, description, location etc.) |
|SICK| [SICK-E](http://clic.cimec.unitn.it/composes/sick.html) (translated) | natural language inference | Detect whether a second sentence is an entailment, a contradiction, or neutral of the first one) |
|STS| [STS](https://www.cs.york.ac.uk/semeval-2012/task6/) (translated) | semantic textual similarity | Detect a semantic similarity grade of two texts |

Futher information is available in **/data**

---

## Prerequisites

You should install all required modules before the start:

* Python 3 with [NumPy](http://www.numpy.org/)/[SciPy](http://www.scipy.org/)
* [Pytorch](http://pytorch.org/)>=0.4
* [scikit-learn](http://scikit-learn.org/stable/index.html)>=0.18.0
* [TensorFlow](https://www.tensorflow.org/) >=1.12.0
* [Keras](https://keras.io/) >=2.2.4
* ...
We recommend using [Anaconda](https://www.anaconda.com/distribution/) package, or you can just run the following command:
```
pip3 install -r requirements.txt
```

## Setup
```
git init
git clone https://github.com/comptechml/SentEvalRu.git
cd SentEvalRu
```
You should store your datasets in */data*, and you could add your examples (new embeddings) to */examples*.

## Examples

Available tasks for russian are situated in */examples*.

## How to use SentEval

To evaluate your sentence embeddings, SentEval requires that you implement two functions:

1. **prepare** (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
2. **batcher** (transforms a batch of text sentences into sentence embeddings)


### 1.) prepare(params, samples) (optional)

*batcher* only sees one batch at a time while the *samples* argument of *prepare* contains all the sentences of a task.

```
prepare(params, samples)
```
* *params*: senteval parameters.
* *samples*: list of all sentences from the tranfer task.
* *output*: No output. Arguments stored in "params" can further be used by *batcher*.

*Example*: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.


### 2.) batcher(params, batch)
```
batcher(params, batch)
```
* *params*: senteval parameters.
* *batch*: numpy array of text sentences (of size params.batch_size)
* *output*: numpy array of sentence embeddings (of size params.batch_size)

*Example*: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

### 3.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

1) to perform the actual evaluation, first import senteval and set its parameters:
```python
import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
```

2) (optional) set the parameters of the classifier (when applicable):
```python
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}
```
You can choose **nhid=0** (Logistic Regression) or **nhid>0** (MLP) and define the parameters for training.

3) Create an instance of the class SE:
```python
se = senteval.engine.SE(params, batcher, prepare)
```

4) define the set of transfer tasks and run the evaluation:
```python
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)
```
The current list of available tasks is:
```python
['SST2', 'SST3', 'MRPC', 'ReadabilityCl', 'TagCl', 'PoemsCl', 'ProzaCl', 'TREC', 'STS', 'SICK']
```

## SentEval parameters
Global parameters of SentEval:
```bash
# senteval parameters
task_path # path to SentEval datasets (required)
seed # seed
usepytorch # use cuda-pytorch (else scikit-learn) where possible
kfold # k-fold validation for MR/CR/SUB/MPQA.
```

Parameters of the classifier:
```bash
nhid: # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity: # how many times dev acc does not increase before training stops
epoch_size: # each epoch corresponds to epoch_size pass on the train set
max_epoch: # max number of epoches
dropout: # dropout for MLP
```


## References

[1] A. Conneau, D. Kiela, [*SentEval: An Evaluation Toolkit for Universal Sentence Representations*](https://arxiv.org/abs/1803.05449)

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*
](https://arxiv.org/abs/1810.04805)

[3] Daniel Cer, Yinfei Yang, Sheng-yi Kong, ... [*Universal Sentence Encoder*](https://arxiv.org/abs/1803.11175)

[4] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)

[5] Martin Klein, Michael L. Nelson, [*Approximating Document Frequency with Term Count Values*](https://arxiv.org/abs/0807.3755)

[6] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler, [*Skip-Thought Vectors*]((https://arxiv.org/abs/1506.06726))
28 changes: 28 additions & 0 deletions data/Genre classification/genre_numeration.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
фэнтези 0
ужасы 1
музыка 2
мюзикл 3
драма 4
документальный 5
вестерн 6
детектив 7
история 8
мелодрама 9
криминал 10
семейный 11
комедия 12
длявзрослых 13
спорт 14
фильм-нуар 15
концерт 16
аниме 17
приключения 18
новости 19
военный 20
фантастика 21
боевик 22
короткометражка 23
биография 24
триллер 25
детский 26
мультфильм 27
Loading