-
Notifications
You must be signed in to change notification settings - Fork 11
Closed
Description
Hi,
fstalign 1.6.1 does not load fst symbol tables properly and modifies the hypothesis FST so it's completely borked:
Here is the output of the command /fstalign/bin/fstalign wer --ref /data/customer/ref.txt --hyp /data/customer/hyp.nlp --symbols /data/customer/hyp.sym --output-sbs /data/customer/res.sbs --log /data/customer/res.log in the current docker. It happens both with txt file (one gold transcript word per line) and ctm with time aligned gold transcript.
[2023-02-16 10:51:17.854] [console] [info] loggers initialized
[+++] [10:51:17] [console] fstalign version is 1.6.1
[+++] [10:51:17] [console] reading reference plain text from /data/customer/ref.txt
[+++] [10:51:17] [console] reading hypothesis fst from /data/customer/hyp.fst
[+++] [10:51:17] [fstalign] starting conversion to int vector
[+++] [10:51:17] [fstalign] converting ref to int vector
[+++] [10:51:17] [OneBestFstLoader] creating std::vector<int> for OneBestFstLoader for 27 tokens
[+++] [10:51:17] [fstalign] converting hyp to int vector
[+++] [10:51:17] [FstFileLoader] convertToIntVector isn't implemented for FST inputs
[+++] [10:51:17] [fstalign] Either ref or hyp is really small, skipping over the levenstein distance, ref size: 27, hyp size: 0
[+++] [10:51:17] [FstFileLoader] Total FST has 27 states.
[+++] [10:51:17] [fstalign] generating ref synonyms from symbol table
[+++] [10:51:17] [fstalign] applying ref synonyms on ref fst
[+++] [10:51:17] [SynonymEngine] we have 0 registered first word rules label id
[+++] [10:51:17] [fstalign] printing ref fst
[+++] [10:51:17] [fstalign] 0 1 8/hello 8/hello 0.0
[+++] [10:51:17] [fstalign] 1 2 9/i'm 9/i'm 0.0
[+++] [10:51:17] [fstalign] 2 3 10/fine 10/fine 0.0
[+++] [10:51:17] [fstalign] 3 4 11/suzana 11/suzana 0.0
[+++] [10:51:17] [fstalign] 4 5 12/how 12/how 0.0
[+++] [10:51:17] [fstalign] 5 6 13/are 13/are 0.0
[+++] [10:51:17] [fstalign] 6 7 14/you 14/you 0.0
[+++] [10:51:17] [fstalign] 7 8 15/mhm 15/mhm 0.0
[+++] [10:51:17] [fstalign] 8 9 16/sure 16/sure 0.0
[+++] [10:51:17] [fstalign] 9 10 17/yes 17/yes 0.0
[+++] [10:51:17] [fstalign] 10 11 18/okay 18/okay 0.0
[+++] [10:51:17] [fstalign] 11 12 19/ah 19/ah 0.0
[+++] [10:51:17] [fstalign] 12 13 20/just 20/just 0.0
[+++] [10:51:17] [fstalign] 13 14 21/a 21/a 0.0
[+++] [10:51:17] [fstalign] 14 15 22/couple 22/couple 0.0
[+++] [10:51:17] [fstalign] 15 16 23/of 23/of 0.0
[+++] [10:51:17] [fstalign] 16 17 24/minutes 24/minutes 0.0
[+++] [10:51:17] [fstalign] 17 18 9/i'm 9/i'm 0.0
[+++] [10:51:17] [fstalign] 18 19 25/on 25/on 0.0
[+++] [10:51:17] [fstalign] 19 20 26/my 26/my 0.0
[+++] [10:51:17] [fstalign] 20 21 27/way 27/way 0.0
[+++] [10:51:17] [fstalign] 21 22 28/into 28/into 0.0
[+++] [10:51:17] [fstalign] 22 23 21/a 21/a 0.0
[+++] [10:51:17] [fstalign] 23 24 29/doctor's 29/doctor's 0.0
[+++] [10:51:17] [fstalign] 24 25 30/appointment 30/appointment 0.0
[+++] [10:51:17] [fstalign] 25 26 18/okay 18/okay 0.0
[+++] [10:51:17] [fstalign] 26 27 18/okay 18/okay 0.0
[+++] [10:51:17] [fstalign] 27 28 0/<eps> 0/<eps> 0.0
[+++] [10:51:17] [fstalign] printing hyp fst
[+++] [10:51:17] [fstalign] 0 1 23/of 23/of 0.99158907
[+++] [10:51:17] [fstalign] 0 1 24/minutes 24/minutes 0.008410932
[+++] [10:51:17] [fstalign] 1 2 25/on 25/on 0.8511446
[+++] [10:51:17] [fstalign] 1 2 2/<ins> 2/<ins> 0.05130972
[+++] [10:51:17] [fstalign] 1 2 1/<oov> 1/<oov> 0.047072377
[+++] [10:51:17] [fstalign] 1 2 26/my 26/my 0.0155186895
[+++] [10:51:17] [fstalign] 1 2 27/way 27/way 0.005192776
[+++] [10:51:17] [fstalign] 1 2 28/into 28/into 0.0033090003
[+++] [10:51:17] [fstalign] 1 2 23/of 23/of 0.0015563301
[+++] [10:51:17] [fstalign] 2 3 29/doctor's 29/doctor's 0.3886394
[+++] [10:51:17] [fstalign] 2 3 30/appointment 30/appointment 0.19079825
[+++] [10:51:17] [fstalign] 2 3 31/ 31/ 0.10623746
[+++] [10:51:17] [fstalign] 2 3 3/<del> 3/<del> 0.04679449
[+++] [10:51:17] [fstalign] 2 3 2/<ins> 2/<ins> 0.034456342
[+++] [10:51:17] [fstalign] 2 3 25/on 25/on 0.0081548225
[+++] [10:51:17] [fstalign] 2 3 32/ 32/ 0.004374613
[+++] [10:51:17] [fstalign] 2 3 26/my 26/my 0.002221351
[+++] [10:51:17] [fstalign] 2 3 1/<oov> 1/<oov> 0.002075816
[+++] [10:51:17] [fstalign] 2 3 33/ 33/ 0.0019667444
[+++] [10:51:17] [fstalign] 2 3 34/ 34/ 0.0005650157
[+++] [10:51:17] [fstalign] 3 4 4/<sub> 4/<sub> 0.69604653
[+++] [10:51:17] [fstalign] 3 4 3/<del> 3/<del> 0.15710989
[+++] [10:51:17] [fstalign] 3 4 30/appointment 30/appointment 0.014679594
[+++] [10:51:17] [fstalign] 3 4 35/ 35/ 0.011887497
[+++] [10:51:17] [fstalign] 3 4 36/ 36/ 0.0065641715
[+++] [10:51:17] [fstalign] 3 4 2/<ins> 2/<ins> 0.0056938045
[+++] [10:51:17] [fstalign] 3 4 25/on 25/on 0.0021069725
[+++] [10:51:17] [fstalign] 3 4 26/my 26/my 0.001466121
[+++] [10:51:17] [fstalign] 3 4 29/doctor's 29/doctor's 0.0013138258
[+++] [10:51:17] [fstalign] 3 4 1/<oov> 1/<oov> 0.0012702389
[+++] [10:51:17] [fstalign] 3 4 17/yes 17/yes 0.0009687771
[+++] [10:51:17] [fstalign] 4 5 5/<inaudible> 5/<inaudible> 0.5213346
[+++] [10:51:17] [fstalign] 4 5 37/ 37/ 0.17481348
[+++] [10:51:17] [fstalign] 4 5 38/ 38/ 0.14042015
[+++] [10:51:17] [fstalign] 4 5 36/ 36/ 0.053299483
[+++] [10:51:17] [fstalign] 4 5 3/<del> 3/<del> 0.042188246
[+++] [10:51:17] [fstalign] 4 5 39/ 39/ 0.011979131
[+++] [10:51:17] [fstalign] 4 5 2/<ins> 2/<ins> 0.0036785
[+++] [10:51:17] [fstalign] 4 5 35/ 35/ 0.002629472
[+++] [10:51:17] [fstalign] 4 5 30/appointment 30/appointment 0.0022271618
[+++] [10:51:17] [fstalign] 4 5 40/ 40/ 0.002156982
[+++] [10:51:17] [fstalign] 4 5 41/ 41/ 0.0016741576
[+++] [10:51:17] [fstalign] 5 6 6/<silence> 6/<silence> 0.6163309
[+++] [10:51:17] [fstalign] 5 6 42/ 42/ 0.18521181
[+++] [10:51:17] [fstalign] 5 6 36/ 36/ 0.056179322
[+++] [10:51:17] [fstalign] 5 6 43/ 43/ 0.038890716
[+++] [10:51:17] [fstalign] 5 6 44/ 44/ 0.0326784
[+++] [10:51:17] [fstalign] 5 6 3/<del> 3/<del> 0.026641503
[+++] [10:51:17] [fstalign] 5 6 45/ 45/ 0.020304155
[+++] [10:51:17] [fstalign] 5 6 46/ 46/ 0.006060596
[+++] [10:51:17] [fstalign] 5 6 5/<inaudible> 5/<inaudible> 0.0041220332
[+++] [10:51:17] [fstalign] 5 6 30/appointment 30/appointment 0.0033203475
[+++] [10:51:17] [fstalign] 5 6 47/ 47/ 0.0029174143
[+++] [10:51:17] [fstalign] 6 7 48/ 48/ 0.34014535
[+++] [10:51:17] [fstalign] 6 7 24/minutes 24/minutes 0.2984986
[+++] [10:51:17] [fstalign] 6 7 49/ 49/ 0.19404508
[+++] [10:51:17] [fstalign] 6 7 10/fine 10/fine 0.016649699
[+++] [10:51:17] [fstalign] 6 7 0/<eps> 0/<eps> 0.013604832
[+++] [10:51:17] [fstalign] 6 7 50/ 50/ 0.0062978663
[+++] [10:51:17] [fstalign] 6 7 51/ 51/ 0.0049225087
[+++] [10:51:17] [fstalign] 6 7 52/ 52/ 0.0039882683
[+++] [10:51:17] [fstalign] 6 7 23/of 23/of 0.0033028126
[+++] [10:51:17] [fstalign] 6 7 53/ 53/ 0.0029480883
[+++] [10:51:17] [fstalign] 6 7 54/ 54/ 0.002575561
[+++] [10:51:17] [fstalign] 7 8 55/ 55/ 0.43735883
[+++] [10:51:17] [fstalign] 7 8 8/hello 8/hello 0.40650827
[+++] [10:51:17] [fstalign] 7 8 56/ 56/ 0.038571022
[+++] [10:51:17] [fstalign] 7 8 57/ 57/ 0.010218942
[+++] [10:51:17] [fstalign] 7 8 58/ 58/ 0.009362684
[+++] [10:51:17] [fstalign] 8 9 59/ 59/ 0.81295073
[+++] [10:51:17] [fstalign] 8 9 60/ 60/ 0.05420902
[+++] [10:51:17] [fstalign] 8 9 61/ 61/ 0.02045335
[+++] [10:51:17] [fstalign] 8 9 24/minutes 24/minutes 0.018062603
[+++] [10:51:17] [fstalign] 8 9 54/ 54/ 0.012383968
[+++] [10:51:17] [fstalign] 8 9 62/ 62/ 0.007653652
[+++] [10:51:17] [fstalign] 9 10 10/fine 10/fine 1.0
[+++] [10:51:17] [fstalign] 10 11 11/suzana 11/suzana 0.7825733
[+++] [10:51:17] [fstalign] 10 11 35/ 35/ 0.10785312
[+++] [10:51:17] [fstalign] 10 11 63/ 63/ 0.09928422
[+++] [10:51:17] [fstalign] 10 11 13/are 13/are 0.010289361
[+++] [10:51:17] [fstalign] 11 12 12/how 12/how 1.0
[+++] [10:51:17] [fstalign] 12 13 13/are 13/are 1.0
[+++] [10:51:17] [fstalign] 13 14 14/you 14/you 1.0
[+++] [10:51:17] [fstalign] 14 15 15/mhm 15/mhm 1.0
[+++] [10:51:17] [fstalign] 15 16 16/sure 16/sure 1.0
[+++] [10:51:17] [fstalign] 16 17 1/<oov> 1/<oov> 0.9930773
[+++] [10:51:17] [fstalign] 16 17 35/ 35/ 0.006035227
[+++] [10:51:17] [fstalign] 16 17 64/ 64/ 0.0008874871
[+++] [10:51:17] [fstalign] 17 18 17/yes 17/yes 0.99649245
[+++] [10:51:17] [fstalign] 17 18 65/ 65/ 0.003507566
[+++] [10:51:17] [fstalign] 18 19 18/okay 18/okay 1.0
[+++] [10:51:17] [fstalign] 19 20 19/ah 19/ah 1.0
[+++] [10:51:17] [fstalign] 20 21 20/just 20/just 0.8747955
[+++] [10:51:17] [fstalign] 20 21 66/ 66/ 0.08273692
[+++] [10:51:17] [fstalign] 21 22 13/are 13/are 0.879016
[+++] [10:51:17] [fstalign] 21 22 39/ 39/ 0.052434582
[+++] [10:51:17] [fstalign] 21 22 66/ 66/ 0.029785942
[+++] [10:51:17] [fstalign] 21 22 67/ 67/ 0.0126816565
[+++] [10:51:17] [fstalign] 21 22 68/ 68/ 0.00951749
[+++] [10:51:17] [fstalign] 21 22 69/ 69/ 0.007649917
[+++] [10:51:17] [fstalign] 21 22 35/ 35/ 0.0052877315
[+++] [10:51:17] [fstalign] 21 22 70/ 70/ 0.0021538541
[+++] [10:51:17] [fstalign] 21 22 71/ 71/ 0.0014728603
[+++] [10:51:17] [fstalign] 22 23 21/a 21/a 0.9477601
[+++] [10:51:17] [fstalign] 22 23 72/ 72/ 0.052239873
[+++] [10:51:17] [fstalign] 23 24 22/couple 22/couple 1.0
[+++] [10:51:17] [fstalign] 24 25 10/fine 10/fine 0.9680188
[+++] [10:51:17] [fstalign] 24 25 54/ 54/ 0.02020042
[+++] [10:51:17] [fstalign] 24 25 56/ 56/ 0.011780825
[+++] [10:51:17] [fstalign] 25 26 10/fine 10/fine 1.0
[+++] [10:51:17] [walker] starting a walk in the park
[+++] [10:51:17] [walker] we have 0 candidates after 28 loops
[+++] [10:51:17] [fstalign] done walking the graph
terminate called after throwing an instance of 'std::runtime_error'
what(): no alignment produced
Aborted (core dumped)
The proper FST is however:
0 1 0 0 0.991589
0 1 1 1 0.00841093
1 2 2 2 0.851145
1 2 3 3 0.0513097
1 2 4 4 0.0470724
1 2 5 5 0.0155187
2 3 6 6 0.388639
2 3 7 7 0.190798
2 3 8 8 0.106237
2 3 9 9 0.0467945
3 4 10 10 0.696047
3 4 9 9 0.15711
3 4 7 7 0.0146796
3 4 11 11 0.0118875
4 5 12 12 0.521335
4 5 13 13 0.174813
4 5 14 14 0.14042
4 5 15 15 0.0532995
5 6 16 16 0.616331
5 6 17 17 0.185212
5 6 15 15 0.0561793
5 6 18 18 0.0388907
6 7 19 19 0.340145
6 7 1 1 0.298499
6 7 20 20 0.194045
6 7 21 21 0.0166497
7 8 22 22 0.437359
7 8 23 23 0.406508
7 8 24 24 0.038571
7 8 25 25 0.0102189
8 9 26 26 0.812951
8 9 27 27 0.054209
8 9 28 28 0.0204534
8 9 1 1 0.0180626
9 10 21 21 1
10 11 29 29 0.782573
10 11 11 11 0.107853
10 11 30 30 0.0992842
10 11 31 31 0.0102894
11 12 32 32 1
12 13 31 31 1
13 14 33 33 1
14 15 34 34 1
15 16 35 35 1
16 17 4 4 0.993077
16 17 11 11 0.00603523
16 17 36 36 0.000887487
17 18 37 37 0.996492
17 18 38 38 0.00350757
18 19 39 39 1
19 20 40 40 1
20 21 41 41 0.874795
20 21 42 42 0.0827369
21 22 31 31 0.879016
21 22 43 43 0.0524346
21 22 42 42 0.0297859
21 22 44 44 0.0126817
22 23 45 45 0.94776
22 23 46 46 0.0522399
23 24 47 47 1
24 25 21 21 0.968019
24 25 48 48 0.0202004
24 25 24 24 0.0117808
25 26 21 21 1
26
with a symbol table:
hello 0
i'm 1
fine 2
suzana 3
how 4
are 5
you 6
mhm 7
sure 8
yes 9
okay 10
ah 11
just 12
a 13
couple 14
of 15
minutes 16
on 17
my 18
way 19
into 20
doctor's 21
appointment 22
oh 23
ooh 24
foreign 25
i 26
foreigners 27
foreigner 28
shawna 29
sean 30
shaun 31
sharon 32
showing 33
show 34
or 35
howard 36
it 37
how're 38
our 39
hard 40
hour 41
is 42
here 43
there 44
ya 45
today 46
avenue 47
hum 48
huh 49
hm 50
wow 51
yeah 52
hey 53
right 54
sir 55
sorry 56
share 57
star 58
no 59
nope 60
know 61
most 62
uh 63
um 64
more 65
enjoy 66
enjoyed 67
er 68
your 69
we're 70
her 71
doctors 72
The bug is thus:
- loading a
hyp size: 0when it is not 0 - symbol table is ignored and symbols in fst are completely botched, the first two lines should be:
0 1 23/oh 23/oh 0.99158907
0 1 24/ooh 24/ooh 0.008410932
but were
[+++] [23:45:44] [fstalign] 0 1 23/of 23/of 0.99158907
[+++] [23:45:44] [fstalign] 0 1 24/minutes 24/minutes 0.008410932
i. e. ids were mistakenly shifted -8 in mapping to symbols.
- strange elements in hyp FST after loading like (never happens in the original fst)? actually the loaded fst looks quite different from the original one!
- arcs that are not in hyp fst - like - [+++] [23:45:44] [fstalign] 5 6 6/ 6/ 0.6163309
- no alignment
qmac
Metadata
Metadata
Assignees
Labels
No labels