Conversation
|
Let's add more tests for the tokenize_text/tokenize_sents one day. Also, current sentence tokenization algorithm is very naive and actually used other way round. What is currently implemented is the reverse scheme. The whole text is being tokenized, than the results of the tokenization are being segmented into the sentences. That has to be thoroughly tested (you might use choppa tests to see if the reverse scheme is working sensible). Another option is to use the old implementation from v1 for the segmentor, which can be found here: https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokenize_uk.py#L57 Ideal solution is to finish the segmentor, of course, but those got stuck, because of the differences in regex API for java and python. Anyway, thank you a ton for looking into this, hopefully we can get this baby shipped one day. |
|
If you notice, I change tox configuration to exclude Python 2.x With 3.x branch it is not so clear for me what proper minimum version should be. |
|
No, it's not |
|
What about minimum Python 3 version? I use 3.6 just in case even if I would not use it. |
|
3.6 is fine. Feel free to drop it if it causes troubles. Was fun to learn, that 3.6 is still by far the most popular version: https://w3techs.com/technologies/history_details/pl-python/3 |
No description provided.