Skip to content
/ Probing Public

This repository contains the code for a Universal Dependencies parser to create probing datasets, based on SentEval.

Notifications You must be signed in to change notification settings

RohinV/Probing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 

Repository files navigation

CoNLL-U UD Parser

This repository contains the code for a Universal Dependencies parser to create probing datasets, based on SentEval. The code script titled "Connlu_UD_parser" takes the languages treebanks from Universal Dependenciies and converts them from the conllu format to SentEval format for efficient probing. The script consists the code to convert any UD CoNLL-U language dataset to create probing datasets for the tasks mention in Conneau et al. (2018). The tasks include: Sentence length, Word content, Bigram shift, Tree depth, Tense, Subject number, Object number, Semantic odd man out, and Coordination inversion.


πŸ“š Features

This parser supports:

  • βœ… Sentence length binning
  • βœ… Dependency tree depth computation
  • βœ… Subject and object number extraction
  • βœ… Tense extraction
  • βœ… Bigram shift generation
  • βœ… Word content extraction
  • βœ… Odd-man-out (SOMO) task generation
  • βœ… Coordination inversion task generation
  • βœ… Automatic data splitting (train, test, validation)
  • βœ… Output writing for SentEval-ready format

Requirements

Make sure you have the following Python packages installed:

pip install numpy conllu tqdm

To use the parser, you need a mapping of languages and their CoNLL-U file paths.

from Conllu_UD_Parser import Conllu_UD_Parser

file_mapping = {
    "{name of language}": [
        "{path to UD language train data}",
        "{path to UD language dev data}"
    ],
    # add subsequent languages, if necessaary
    }
parser = Conllu_UD_Parser(file_mapping=file_mapping)
output_paths = parser.process()
print(output_paths)

About

This repository contains the code for a Universal Dependencies parser to create probing datasets, based on SentEval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages