Preserve all non-whitespace characters in the input sentence#12
Preserve all non-whitespace characters in the input sentence#12kukas wants to merge 2 commits intolang-uk:masterfrom
Conversation
|
I spoke to @arysin on that matter. For example, the corresponding code for the regex you are altering is there: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/uk/src/main/java/org/languagetool/tokenizers/uk/UkrainianWordTokenizer.java#L39 (tests are also updated languagetool-org/languagetool@95b83d7) His idea is that making it all non-space is way to broad and might cause unwanted consequences on other samples (not covered with tests). I'll ask him to comment here too. |
|
I have extensive set of tests (unit tests and multimillion token tests) but using languagetool as mentioned above. |
|
Hello, thank you for your comments and sorry for the late response. The manual enumeration has the not-so-nice property, that any character not in the list will not be present in the tokenized output. This includes quite common characters such as euro symbol €, degree symbol ° or multiplication symbol ×, that were present in my own testing data. |
|
Sorry, it's hard for me to comment on this change, as the original regexp in this code is very different from the tokenizer in LanguageTool I maintain. I agree that symbols should not be dropped from tokenized text, and from what I tried, the regex in the LanguageTool already passes all the new tests above. |
|
BTW there's little Python wrapper for the LT/nlp_uk modules https://github.com/brown-uk/nlp_uk/tree/master/src/main/python |
The tokenizer omits some characters not covered by the
WORD_TOKENIZATION_RULESregex. An example is the € character.The sentence "за ставкою € 1." gets tokenized as:
["за", "ставкою", "1", "."]and the € character is left out completely.This pull request fixes this problem by covering all non-whitespace characters in the
WORD_TOKENIZATION_RULESregex.