Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
f1d0e3a
Create $quiz-response.md
Oct 30, 2018
eabf571
Rename $quiz-response.md to quiz-response.md
Oct 30, 2018
a217e98
Update and rename quiz-response.md to quiz-1-response.md
Oct 30, 2018
b90a5ec
Update quiz-1-response.md
Oct 30, 2018
8f016f0
report-1
Oct 31, 2018
4f78bef
Update report-1.md
Oct 31, 2018
ba953d9
Update and rename report-1.md to report-segmentation-Sinitsyna.md
Nov 1, 2018
657c20c
Add files via upload
Nov 1, 2018
5ac2b56
Add files via upload
Nov 2, 2018
e369953
Create report-tokenization-Sinitsyna.md
Nov 2, 2018
d31757e
Delete report-tokenization-Sinitsyna.md
Nov 2, 2018
949a9d4
Create report-tokenization-Sinitsyna
Nov 2, 2018
cf5b23e
Rename report-tokenization-Sinitsyna to report-tokenization-Sinitsyna.md
Nov 2, 2018
9062421
Delete collect_gold_sent.py
Nov 2, 2018
288c963
Delete collect_test.py
Nov 2, 2018
8ada2de
Delete create_dict.py
Nov 2, 2018
2639fb1
Delete maxmatch.py
Nov 2, 2018
e66a43d
Delete evaluation.py
Nov 2, 2018
adb35fb
Update report-tokenization-Sinitsyna.md
Nov 2, 2018
4a8bafa
Add files via upload
Nov 2, 2018
f6aea7f
Create quiz-2-response.md
Nov 20, 2018
506af48
Update quiz-2-response.md
Nov 20, 2018
6cf7dcb
Update quiz-2-response.md
Nov 20, 2018
338be84
Create report-transliteration-Sinitsyna.md
Nov 20, 2018
a4ab0e1
add second task transliteration
Nov 20, 2018
82af214
Update quiz-2-response.md
Nov 20, 2018
51f148c
Update quiz-2-response.md
Nov 21, 2018
07e57ce
Add files via upload
Nov 21, 2018
b5bbde9
Create morphological_disambiguation.md
Dec 12, 2018
3103924
Create quiz-3-response.md
Dec 12, 2018
394c742
Create practical_1
dasinitsyna Mar 26, 2019
6430ccf
Delete practical_1
dasinitsyna Mar 26, 2019
5b9cc9b
updated folder order
dasinitsyna Mar 26, 2019
4134edf
upd folder order again
dasinitsyna Mar 26, 2019
729f6bb
upd quiz order
dasinitsyna Mar 26, 2019
95dfdac
added 4th practical
dasinitsyna Mar 26, 2019
4314de7
added 4th practical
dasinitsyna Mar 27, 2019
0ded45e
upd
dasinitsyna Mar 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
## Morphological disambiguation
### Part-of-Speech Tagger comparison
I've decided to compare three part of speech taggers on the UD Bulgarian corpus. First, I've used UDPipe, as it says in the task, and achieved these results:
~~~~
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
UPOS | 97.81 | 97.81 | 97.81 | 97.81
~~~~

### Constraint Grammar
I've tried to write down the rules to disambiguate the example sentence. The only tags I couldn't rid of were the ones that weren't POS since they have a different structure I couldn't work around of.
Here are the rules that worked:

~~~~
DELIMITERS = "." ;

LIST DET = DET ;
LIST PUNCT = PUNCT ;
LIST NOUN = NOUN ;
LIST VERB = VERB ;
LIST PRON = PRON ;
LIST ADP = ADP ;
LIST PART = PART ;

SECTION

REMOVE DET IF (1C PUNCT) ;
REMOVE PRON IF (1 VERB OR NOUN) ;
REMOVE DET IF (-1C ADP) ;
REMOVE PART ;
~~~~

And the ones that didn't:
~~~~
LIST Gen = Gen ;
LIST Acc = Acc ;
LIST Nom = Nom ;

REMOVE Acc IF (-1C NOUN) ;
REMOVE Nom IF (-1C ADP) ;
~~~~

In the end, I was left with this:
~~~~
"<Однако>"
"однако" ADV Degree=Pos
"<стиль>"
"стиль" NOUN Animacy=Inan Case=Nom Gender=Masc Number=Sing
"стиль" NOUN Animacy=Inan Case=Acc Gender=Masc Number=Sing
"<работы>"
"работа" NOUN Animacy=Inan Case=Gen Gender=Fem Number=Sing
"работа" NOUN Animacy=Inan Case=Nom Gender=Fem Number=Plur
"работа" NOUN Animacy=Inan Case=Acc Gender=Fem Number=Plur
"<Семена>"
"Семен" PROPN Animacy=Anim Case=Gen Gender=Masc Number=Sing
"Семен" PROPN Animacy=Anim Case=Acc Gender=Masc Number=Sing
"<Еремеевича>"
"Еремеевич" PROPN Animacy=Anim Case=Gen Gender=Masc Number=Sing
"<заключался>"
"заключаться" VERB Aspect=Imp Gender=Masc Mood=Ind Number=Sing Tense=Past VerbForm=Fin Voice=Mid
"<в>"
"в" ADP
"<том>"
"то" PRON Animacy=Inan Case=Loc Gender=Neut Number=Sing
; "тот" DET Case=Loc Gender=Neut Number=Sing REMOVE:16
; "тот" DET Case=Loc Gender=Masc Number=Sing REMOVE:16
"<,>"
"," PUNCT
"<чтобы>"
"чтобы" SCONJ Mood=Cnd
"<принимать>"
"принимать" VERB Aspect=Imp VerbForm=Inf Voice=Act
"<всех>"
"весь" DET Case=Gen Number=Plur
"весь" DET Case=Loc Number=Plur
"весь" DET Case=Acc Number=Plur
; "все" PRON Animacy=Anim Case=Acc Number=Plur REMOVE:20
; "все" PRON Animacy=Anim Case=Gen Number=Plur REMOVE:20
"<желающих>"
"желать" VERB Aspect=Imp Case=Gen Number=Plur Tense=Pres VerbForm=Part Voice=Act
"<и>"
"и" CCONJ
; "и" PART REMOVE:26
"<лично>"
"лично" ADV Degree=Pos
"<вникать>"
"*вникать"
"<в>"
"в" ADP
"<дело>"
"дело" NOUN Animacy=Inan Case=Nom Gender=Neut Number=Sing
"дело" NOUN Animacy=Inan Case=Acc Gender=Neut Number=Sing
"<.>"
"." PUNCT
~~~~
### Improving perceptron tagger
Here are the orginal results of the perceptron tagger on Spanish UD:
~~~~
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
UPOS | 94.72 | 94.72 | 94.72 | 94.72
~~~~
Here, I've managed to slighlty improve the perceptron used on *UD Spanish* by changing the parameters with suffix and pref1.
I've changed the length of pref1 to *add('i pref1', word[0:3])* and got the following results:
~~~~
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
UPOS | 95.58 | 95.58 | 95.58 | 95.58
~~~~
Then, I've played a bit with the suffix parameters, changing all of the suffix instances from [-3] to [-2]:
~~~~
Metrics | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
UPOS | 95.21 | 95.21 | 95.21 | 95.21
~~~~
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from io import open
from conllu import parse_incr

def main():
print('Extract sentences from the file:')
test_file = str(input())

print('Write sentences to file:')
res_file= str(input())

print('Write gold sentences to file:')
gold_file= str(input())

data = open(test_file, "r", encoding="utf-8")
all_sent = []
gold_sent = []

for tokenlist in parse_incr(data):
all_sent.append(tuple(dict(tokenlist.metadata).values())[1])
sent=[]
for token in tokenlist:
sent.append(token['form'])
gold_sent.append(' '.join(sent))

res = open(res_file, 'w', encoding='utf-8')
for line in all_sent:
res.write(line + '\n')
res.close()
print('{} sentences were extracted and writen to {}'.format(len(all_sent), res_file))

gd = open(gold_file, 'w', encoding='utf-8')
for line in gold_sent:
gd.write(line + '\n')
gd.close()
print('{} sentences were extracted and writen to {}'.format(len(gold_sent), gold_file))

if __name__ == '__main__':
main()
23 changes: 23 additions & 0 deletions 2018-komp-ling/practicals/segmentation-tokenization/create_dict.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from io import open
from conllu import parse_incr

def main():
print('Name of conllu file to parse:')
train = str(input())
print('Name of dictionary to save:')
name = str(input())
all_sent=[]
data_file = open(train, "r", encoding="utf-8")
for tokenlist in parse_incr(data_file):
for token in tokenlist:
all_sent.append(token['form'])
k = list(set(all_sent))
k.sort(key = lambda s: len(s), reverse=True)
data_file.close()
diction = open(name, 'w', encoding='utf-8')
for line in k:
diction.write(line + '\n')
diction.close()
print('Dictionary is seccefully created. Number of words is {}'.format(len(k)))

main()
42 changes: 42 additions & 0 deletions 2018-komp-ling/practicals/segmentation-tokenization/evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
def main():
print('Insert file name with original segmentation')
f = str(input())
print('Insert file name with your segmentation')
p = str(input())

all_sent = open(f, "r", encoding="utf-8").read().splitlines()
parsed_s = open(p, "r", encoding="utf-8").read().splitlines()

tp,tn,fp,fn = (0,0,0,0)

for g in range(0,577,4):
for ind in range(len(all_sent[g:g+4])):
i,j=(0,0)
first = list(all_sent[ind])
second = list(parsed_s[ind])
while i<=(len(first)-1):
g_s_l = first[i]
p_s_l = second[j]
punct=',-:。“”【】?、_.「」・/\|%~!'
if (g_s_l.isalpha() or g_s_l in punct) and (p_s_l.isalpha() or p_s_l in punct):
tn+=1
i+=1
j+=1
if g_s_l==' ' and p_s_l==' ':
tp+=1
i += 1
j += 1
if g_s_l==' ' and (p_s_l.isalpha()or p_s_l in punct):
fn+=1
i += 1
if (g_s_l.isalpha()or g_s_l in punct) and p_s_l==' ':
j += 1
fp+=1
print ('TruePositive: {0}, TrueNegative: {1}, FalsePositive: {2}, FalseNegative: {3}'.format(tp,tn,fp,fn))
Acc = (tp+tn)/float(tp+tn+fp+fn)
Prec = tp/float(tp+fp)
Recall = tp/float(tp+fn)
Fscore = 2*Prec*Recall/float(Prec+Recall)
print('Accuracy: {}, F-score: {}'.format(round(Acc, 2), round(Fscore, 2)))

main()
43 changes: 43 additions & 0 deletions 2018-komp-ling/practicals/segmentation-tokenization/maxmatch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
parsed_sent=[]
def maxmatch(sentence, dictionary):
global parsed_sent
if len(sentence) == 0:
return 'list is empty'
for i in range(len(sentence), -1, -1):
firstword = sentence[0:i]
remainder = sentence[i:]
if firstword in dictionary:
parsed_sent.append(firstword)
return maxmatch(remainder, dictionary)
if i == 1:
firstword = sentence[0]
remainder = sentence[1:]
parsed_sent.append(remainder)
parsed_sent.append(firstword)

def main():
global parsed_sent
print('Insert name of dict:')
n_dict=str(input())
used_dict = open(n_dict, 'r', encoding='utf-8').read().splitlines()

print('Sentence to parce:')
n_test_sent=str(input())
sentences = open(n_test_sent, 'r', encoding='utf-8').read().splitlines()

print('Save to:')
s_file = str(input())


res = []
for sent in sentences:
maxmatch(sent, used_dict)
res.append(parsed_sent)
parsed_sent=[]

save = open(s_file, 'w', encoding='utf-8')
for i in res:
save.write(' '.join(i)+'\n')
save.close()

main()
Loading