Tokenizer lexicons #14

HAKSOAT · 2020-06-12T22:54:59Z

Created lexicons for cjk and non-cjk texts

halfak · 2020-06-22T15:09:06Z

deltas/tokenizers/wikitext_split.py


-wikitext_split = RegexTokenizer(LEXICON)
+LEXICON_LATIN = LEXICON.copy()
+LEXICON_LATIN.insert(-2, ('cjk', cjk))


Why not insert this right after "word"?

I can do that. My thought process was since we won't have lots of cjk in a regular latin-dominant text, we don't need to handle them before the tab_open, tab_close, etc.

halfak · 2020-06-22T15:11:51Z

deltas/tokenizers/wikitext_split.py

+combined_word = devangari_word + arabic_word + bengali_word + korean_word

-word = r'(?:[^\W\d]|[' + combined_word + r'])' + \
+cjk_re = r'\u3040-\u30ff' + r'\u4e00-\u9FFF'


Does this still cover the full range?

Yes. It does.

halfak · 2020-06-22T15:12:24Z

deltas/tokenizers/wikitext_split.py

+
+cjk = r'[' + cjk_re + ']'
+
+word = r'(?:[^\W\d' + cjk_re + r']|[' + combined_word + r'])' + \


Do we need to explicitly exclude CJK here?

Without doing that, some cjk values get captured as word.

HAKSOAT changed the title ~~Created lexicons for cjk and non-cjk texts~~ Tokenizer lexicons Jun 12, 2020

Created lexicons for cjk and non-cjk texts

e9bd355

HAKSOAT force-pushed the tokenizer_scripts branch from 8e0c442 to e9bd355 Compare June 12, 2020 23:19

halfak reviewed Jun 22, 2020

View reviewed changes

HAKSOAT force-pushed the tokenizer_scripts branch 6 times, most recently from f3489da to 1d59026 Compare June 22, 2020 20:02

Refactored lexicons

df6225b

HAKSOAT force-pushed the tokenizer_scripts branch from 1d59026 to df6225b Compare June 22, 2020 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer lexicons #14

Tokenizer lexicons #14

Uh oh!

HAKSOAT commented Jun 12, 2020 •

edited

Loading

Uh oh!

halfak Jun 22, 2020

Uh oh!

HAKSOAT Jun 22, 2020

Uh oh!

halfak Jun 22, 2020

Uh oh!

HAKSOAT Jun 22, 2020

Uh oh!

halfak Jun 22, 2020

Uh oh!

HAKSOAT Jun 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		cjk = r'[' + cjk_re + ']'

		word = r'(?:[^\W\d' + cjk_re + r']\|[' + combined_word + r'])' + \

Tokenizer lexicons #14

Are you sure you want to change the base?

Tokenizer lexicons #14

Uh oh!

Conversation

HAKSOAT commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

halfak Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

HAKSOAT Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

halfak Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

HAKSOAT Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

halfak Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

HAKSOAT Jun 22, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HAKSOAT commented Jun 12, 2020 •

edited

Loading