Skip to content

Case: fstalign ignores symbols table and does not find alignment in a simple transcript #40

@niedakh

Description

@niedakh

Hi,

fstalign 1.6.1 does not load fst symbol tables properly and modifies the hypothesis FST so it's completely borked:

Here is the output of the command /fstalign/bin/fstalign wer --ref /data/customer/ref.txt --hyp /data/customer/hyp.nlp --symbols /data/customer/hyp.sym --output-sbs /data/customer/res.sbs --log /data/customer/res.log in the current docker. It happens both with txt file (one gold transcript word per line) and ctm with time aligned gold transcript.

[2023-02-16 10:51:17.854] [console] [info] loggers initialized
[+++] [10:51:17] [console] fstalign version is 1.6.1
[+++] [10:51:17] [console] reading reference plain text from /data/customer/ref.txt
[+++] [10:51:17] [console] reading hypothesis fst from /data/customer/hyp.fst
[+++] [10:51:17] [fstalign] starting conversion to int vector
[+++] [10:51:17] [fstalign] converting ref to int vector
[+++] [10:51:17] [OneBestFstLoader] creating std::vector<int> for OneBestFstLoader for 27 tokens
[+++] [10:51:17] [fstalign] converting hyp to int vector
[+++] [10:51:17] [FstFileLoader] convertToIntVector isn't implemented for FST inputs
[+++] [10:51:17] [fstalign] Either ref or hyp is really small, skipping over the levenstein distance,  ref size: 27, hyp size: 0
[+++] [10:51:17] [FstFileLoader] Total FST has 27 states.
[+++] [10:51:17] [fstalign] generating ref synonyms from symbol table
[+++] [10:51:17] [fstalign] applying ref synonyms on ref fst
[+++] [10:51:17] [SynonymEngine] we have 0 registered first word rules label id
[+++] [10:51:17] [fstalign] printing ref fst
[+++] [10:51:17] [fstalign] 0	1	8/hello	8/hello	0.0
[+++] [10:51:17] [fstalign] 1	2	9/i'm	9/i'm	0.0
[+++] [10:51:17] [fstalign] 2	3	10/fine	10/fine	0.0
[+++] [10:51:17] [fstalign] 3	4	11/suzana	11/suzana	0.0
[+++] [10:51:17] [fstalign] 4	5	12/how	12/how	0.0
[+++] [10:51:17] [fstalign] 5	6	13/are	13/are	0.0
[+++] [10:51:17] [fstalign] 6	7	14/you	14/you	0.0
[+++] [10:51:17] [fstalign] 7	8	15/mhm	15/mhm	0.0
[+++] [10:51:17] [fstalign] 8	9	16/sure	16/sure	0.0
[+++] [10:51:17] [fstalign] 9	10	17/yes	17/yes	0.0
[+++] [10:51:17] [fstalign] 10	11	18/okay	18/okay	0.0
[+++] [10:51:17] [fstalign] 11	12	19/ah	19/ah	0.0
[+++] [10:51:17] [fstalign] 12	13	20/just	20/just	0.0
[+++] [10:51:17] [fstalign] 13	14	21/a	21/a	0.0
[+++] [10:51:17] [fstalign] 14	15	22/couple	22/couple	0.0
[+++] [10:51:17] [fstalign] 15	16	23/of	23/of	0.0
[+++] [10:51:17] [fstalign] 16	17	24/minutes	24/minutes	0.0
[+++] [10:51:17] [fstalign] 17	18	9/i'm	9/i'm	0.0
[+++] [10:51:17] [fstalign] 18	19	25/on	25/on	0.0
[+++] [10:51:17] [fstalign] 19	20	26/my	26/my	0.0
[+++] [10:51:17] [fstalign] 20	21	27/way	27/way	0.0
[+++] [10:51:17] [fstalign] 21	22	28/into	28/into	0.0
[+++] [10:51:17] [fstalign] 22	23	21/a	21/a	0.0
[+++] [10:51:17] [fstalign] 23	24	29/doctor's	29/doctor's	0.0
[+++] [10:51:17] [fstalign] 24	25	30/appointment	30/appointment	0.0
[+++] [10:51:17] [fstalign] 25	26	18/okay	18/okay	0.0
[+++] [10:51:17] [fstalign] 26	27	18/okay	18/okay	0.0
[+++] [10:51:17] [fstalign] 27	28	0/<eps>	0/<eps>	0.0
[+++] [10:51:17] [fstalign] printing hyp fst
[+++] [10:51:17] [fstalign] 0	1	23/of	23/of	0.99158907
[+++] [10:51:17] [fstalign] 0	1	24/minutes	24/minutes	0.008410932
[+++] [10:51:17] [fstalign] 1	2	25/on	25/on	0.8511446
[+++] [10:51:17] [fstalign] 1	2	2/<ins>	2/<ins>	0.05130972
[+++] [10:51:17] [fstalign] 1	2	1/<oov>	1/<oov>	0.047072377
[+++] [10:51:17] [fstalign] 1	2	26/my	26/my	0.0155186895
[+++] [10:51:17] [fstalign] 1	2	27/way	27/way	0.005192776
[+++] [10:51:17] [fstalign] 1	2	28/into	28/into	0.0033090003
[+++] [10:51:17] [fstalign] 1	2	23/of	23/of	0.0015563301
[+++] [10:51:17] [fstalign] 2	3	29/doctor's	29/doctor's	0.3886394
[+++] [10:51:17] [fstalign] 2	3	30/appointment	30/appointment	0.19079825
[+++] [10:51:17] [fstalign] 2	3	31/	31/	0.10623746
[+++] [10:51:17] [fstalign] 2	3	3/<del>	3/<del>	0.04679449
[+++] [10:51:17] [fstalign] 2	3	2/<ins>	2/<ins>	0.034456342
[+++] [10:51:17] [fstalign] 2	3	25/on	25/on	0.0081548225
[+++] [10:51:17] [fstalign] 2	3	32/	32/	0.004374613
[+++] [10:51:17] [fstalign] 2	3	26/my	26/my	0.002221351
[+++] [10:51:17] [fstalign] 2	3	1/<oov>	1/<oov>	0.002075816
[+++] [10:51:17] [fstalign] 2	3	33/	33/	0.0019667444
[+++] [10:51:17] [fstalign] 2	3	34/	34/	0.0005650157
[+++] [10:51:17] [fstalign] 3	4	4/<sub>	4/<sub>	0.69604653
[+++] [10:51:17] [fstalign] 3	4	3/<del>	3/<del>	0.15710989
[+++] [10:51:17] [fstalign] 3	4	30/appointment	30/appointment	0.014679594
[+++] [10:51:17] [fstalign] 3	4	35/	35/	0.011887497
[+++] [10:51:17] [fstalign] 3	4	36/	36/	0.0065641715
[+++] [10:51:17] [fstalign] 3	4	2/<ins>	2/<ins>	0.0056938045
[+++] [10:51:17] [fstalign] 3	4	25/on	25/on	0.0021069725
[+++] [10:51:17] [fstalign] 3	4	26/my	26/my	0.001466121
[+++] [10:51:17] [fstalign] 3	4	29/doctor's	29/doctor's	0.0013138258
[+++] [10:51:17] [fstalign] 3	4	1/<oov>	1/<oov>	0.0012702389
[+++] [10:51:17] [fstalign] 3	4	17/yes	17/yes	0.0009687771
[+++] [10:51:17] [fstalign] 4	5	5/<inaudible>	5/<inaudible>	0.5213346
[+++] [10:51:17] [fstalign] 4	5	37/	37/	0.17481348
[+++] [10:51:17] [fstalign] 4	5	38/	38/	0.14042015
[+++] [10:51:17] [fstalign] 4	5	36/	36/	0.053299483
[+++] [10:51:17] [fstalign] 4	5	3/<del>	3/<del>	0.042188246
[+++] [10:51:17] [fstalign] 4	5	39/	39/	0.011979131
[+++] [10:51:17] [fstalign] 4	5	2/<ins>	2/<ins>	0.0036785
[+++] [10:51:17] [fstalign] 4	5	35/	35/	0.002629472
[+++] [10:51:17] [fstalign] 4	5	30/appointment	30/appointment	0.0022271618
[+++] [10:51:17] [fstalign] 4	5	40/	40/	0.002156982
[+++] [10:51:17] [fstalign] 4	5	41/	41/	0.0016741576
[+++] [10:51:17] [fstalign] 5	6	6/<silence>	6/<silence>	0.6163309
[+++] [10:51:17] [fstalign] 5	6	42/	42/	0.18521181
[+++] [10:51:17] [fstalign] 5	6	36/	36/	0.056179322
[+++] [10:51:17] [fstalign] 5	6	43/	43/	0.038890716
[+++] [10:51:17] [fstalign] 5	6	44/	44/	0.0326784
[+++] [10:51:17] [fstalign] 5	6	3/<del>	3/<del>	0.026641503
[+++] [10:51:17] [fstalign] 5	6	45/	45/	0.020304155
[+++] [10:51:17] [fstalign] 5	6	46/	46/	0.006060596
[+++] [10:51:17] [fstalign] 5	6	5/<inaudible>	5/<inaudible>	0.0041220332
[+++] [10:51:17] [fstalign] 5	6	30/appointment	30/appointment	0.0033203475
[+++] [10:51:17] [fstalign] 5	6	47/	47/	0.0029174143
[+++] [10:51:17] [fstalign] 6	7	48/	48/	0.34014535
[+++] [10:51:17] [fstalign] 6	7	24/minutes	24/minutes	0.2984986
[+++] [10:51:17] [fstalign] 6	7	49/	49/	0.19404508
[+++] [10:51:17] [fstalign] 6	7	10/fine	10/fine	0.016649699
[+++] [10:51:17] [fstalign] 6	7	0/<eps>	0/<eps>	0.013604832
[+++] [10:51:17] [fstalign] 6	7	50/	50/	0.0062978663
[+++] [10:51:17] [fstalign] 6	7	51/	51/	0.0049225087
[+++] [10:51:17] [fstalign] 6	7	52/	52/	0.0039882683
[+++] [10:51:17] [fstalign] 6	7	23/of	23/of	0.0033028126
[+++] [10:51:17] [fstalign] 6	7	53/	53/	0.0029480883
[+++] [10:51:17] [fstalign] 6	7	54/	54/	0.002575561
[+++] [10:51:17] [fstalign] 7	8	55/	55/	0.43735883
[+++] [10:51:17] [fstalign] 7	8	8/hello	8/hello	0.40650827
[+++] [10:51:17] [fstalign] 7	8	56/	56/	0.038571022
[+++] [10:51:17] [fstalign] 7	8	57/	57/	0.010218942
[+++] [10:51:17] [fstalign] 7	8	58/	58/	0.009362684
[+++] [10:51:17] [fstalign] 8	9	59/	59/	0.81295073
[+++] [10:51:17] [fstalign] 8	9	60/	60/	0.05420902
[+++] [10:51:17] [fstalign] 8	9	61/	61/	0.02045335
[+++] [10:51:17] [fstalign] 8	9	24/minutes	24/minutes	0.018062603
[+++] [10:51:17] [fstalign] 8	9	54/	54/	0.012383968
[+++] [10:51:17] [fstalign] 8	9	62/	62/	0.007653652
[+++] [10:51:17] [fstalign] 9	10	10/fine	10/fine	1.0
[+++] [10:51:17] [fstalign] 10	11	11/suzana	11/suzana	0.7825733
[+++] [10:51:17] [fstalign] 10	11	35/	35/	0.10785312
[+++] [10:51:17] [fstalign] 10	11	63/	63/	0.09928422
[+++] [10:51:17] [fstalign] 10	11	13/are	13/are	0.010289361
[+++] [10:51:17] [fstalign] 11	12	12/how	12/how	1.0
[+++] [10:51:17] [fstalign] 12	13	13/are	13/are	1.0
[+++] [10:51:17] [fstalign] 13	14	14/you	14/you	1.0
[+++] [10:51:17] [fstalign] 14	15	15/mhm	15/mhm	1.0
[+++] [10:51:17] [fstalign] 15	16	16/sure	16/sure	1.0
[+++] [10:51:17] [fstalign] 16	17	1/<oov>	1/<oov>	0.9930773
[+++] [10:51:17] [fstalign] 16	17	35/	35/	0.006035227
[+++] [10:51:17] [fstalign] 16	17	64/	64/	0.0008874871
[+++] [10:51:17] [fstalign] 17	18	17/yes	17/yes	0.99649245
[+++] [10:51:17] [fstalign] 17	18	65/	65/	0.003507566
[+++] [10:51:17] [fstalign] 18	19	18/okay	18/okay	1.0
[+++] [10:51:17] [fstalign] 19	20	19/ah	19/ah	1.0
[+++] [10:51:17] [fstalign] 20	21	20/just	20/just	0.8747955
[+++] [10:51:17] [fstalign] 20	21	66/	66/	0.08273692
[+++] [10:51:17] [fstalign] 21	22	13/are	13/are	0.879016
[+++] [10:51:17] [fstalign] 21	22	39/	39/	0.052434582
[+++] [10:51:17] [fstalign] 21	22	66/	66/	0.029785942
[+++] [10:51:17] [fstalign] 21	22	67/	67/	0.0126816565
[+++] [10:51:17] [fstalign] 21	22	68/	68/	0.00951749
[+++] [10:51:17] [fstalign] 21	22	69/	69/	0.007649917
[+++] [10:51:17] [fstalign] 21	22	35/	35/	0.0052877315
[+++] [10:51:17] [fstalign] 21	22	70/	70/	0.0021538541
[+++] [10:51:17] [fstalign] 21	22	71/	71/	0.0014728603
[+++] [10:51:17] [fstalign] 22	23	21/a	21/a	0.9477601
[+++] [10:51:17] [fstalign] 22	23	72/	72/	0.052239873
[+++] [10:51:17] [fstalign] 23	24	22/couple	22/couple	1.0
[+++] [10:51:17] [fstalign] 24	25	10/fine	10/fine	0.9680188
[+++] [10:51:17] [fstalign] 24	25	54/	54/	0.02020042
[+++] [10:51:17] [fstalign] 24	25	56/	56/	0.011780825
[+++] [10:51:17] [fstalign] 25	26	10/fine	10/fine	1.0
[+++] [10:51:17] [walker] starting a walk in the park
[+++] [10:51:17] [walker] we have 0 candidates after 28 loops
[+++] [10:51:17] [fstalign] done walking the graph
terminate called after throwing an instance of 'std::runtime_error'
  what():  no alignment produced
Aborted                 (core dumped)

The proper FST is however:

0	1	0	0	0.991589
0	1	1	1	0.00841093
1	2	2	2	0.851145
1	2	3	3	0.0513097
1	2	4	4	0.0470724
1	2	5	5	0.0155187
2	3	6	6	0.388639
2	3	7	7	0.190798
2	3	8	8	0.106237
2	3	9	9	0.0467945
3	4	10	10	0.696047
3	4	9	9	0.15711
3	4	7	7	0.0146796
3	4	11	11	0.0118875
4	5	12	12	0.521335
4	5	13	13	0.174813
4	5	14	14	0.14042
4	5	15	15	0.0532995
5	6	16	16	0.616331
5	6	17	17	0.185212
5	6	15	15	0.0561793
5	6	18	18	0.0388907
6	7	19	19	0.340145
6	7	1	1	0.298499
6	7	20	20	0.194045
6	7	21	21	0.0166497
7	8	22	22	0.437359
7	8	23	23	0.406508
7	8	24	24	0.038571
7	8	25	25	0.0102189
8	9	26	26	0.812951
8	9	27	27	0.054209
8	9	28	28	0.0204534
8	9	1	1	0.0180626
9	10	21	21	1
10	11	29	29	0.782573
10	11	11	11	0.107853
10	11	30	30	0.0992842
10	11	31	31	0.0102894
11	12	32	32	1
12	13	31	31	1
13	14	33	33	1
14	15	34	34	1
15	16	35	35	1
16	17	4	4	0.993077
16	17	11	11	0.00603523
16	17	36	36	0.000887487
17	18	37	37	0.996492
17	18	38	38	0.00350757
18	19	39	39	1
19	20	40	40	1
20	21	41	41	0.874795
20	21	42	42	0.0827369
21	22	31	31	0.879016
21	22	43	43	0.0524346
21	22	42	42	0.0297859
21	22	44	44	0.0126817
22	23	45	45	0.94776
22	23	46	46	0.0522399
23	24	47	47	1
24	25	21	21	0.968019
24	25	48	48	0.0202004
24	25	24	24	0.0117808
25	26	21	21	1
26

with a symbol table:

hello	0
i'm	1
fine	2
suzana 3
how	4
are	5
you	6
mhm	7
sure	8
yes	9
okay	10
ah	11
just	12
a	13
couple	14
of	15
minutes	16
on	17
my	18
way	19
into	20
doctor's	21
appointment	22
oh	23
ooh	24
foreign	25
i	26
foreigners	27
foreigner	28
shawna	29
sean	30
shaun	31
sharon	32
showing	33
show	34
or	35
howard	36
it	37
how're	38
our	39
hard	40
hour	41
is	42
here	43
there	44
ya	45
today	46
avenue	47
hum	48
huh	49
hm	50
wow	51
yeah	52
hey	53
right	54
sir	55
sorry	56
share	57
star	58
no	59
nope	60
know	61
most	62
uh	63
um	64
more	65
enjoy	66
enjoyed	67
er	68
your	69
we're	70
her	71
doctors	72

The bug is thus:

  1. loading a hyp size: 0 when it is not 0
  2. symbol table is ignored and symbols in fst are completely botched, the first two lines should be:
0	1	23/oh	23/oh	0.99158907
0	1	24/ooh    24/ooh     0.008410932

but were

[+++] [23:45:44] [fstalign] 0	1	23/of	23/of	0.99158907
[+++] [23:45:44] [fstalign] 0	1	24/minutes	24/minutes	0.008410932

i. e. ids were mistakenly shifted -8 in mapping to symbols.

  1. strange elements in hyp FST after loading like (never happens in the original fst)? actually the loaded fst looks quite different from the original one!
  2. arcs that are not in hyp fst - like - [+++] [23:45:44] [fstalign] 5 6 6/ 6/ 0.6163309
  3. no alignment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions