Make tests pass by kant2002 · Pull Request #11 · lang-uk/tokenize-uk

kant2002 · 2023-08-28T06:02:24Z

No description provided.

dchaplinsky · 2023-08-28T10:30:15Z

Let's add more tests for the tokenize_text/tokenize_sents one day.

Also, current sentence tokenization algorithm is very naive and actually used other way round.
tokenizer expects to receive a sentence to split it into words
So correct scheme is going to look like this:
segmentor (choppa was the plan) breaks the text into sentences
then each sentence is being fed into tokenizer.

What is currently implemented is the reverse scheme. The whole text is being tokenized, than the results of the tokenization are being segmented into the sentences. That has to be thoroughly tested (you might use choppa tests to see if the reverse scheme is working sensible).

Another option is to use the old implementation from v1 for the segmentor, which can be found here: https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokenize_uk.py#L57

Ideal solution is to finish the segmentor, of course, but those got stuck, because of the differences in regex API for java and python.

Anyway, thank you a ton for looking into this, hopefully we can get this baby shipped one day.

kant2002 · 2023-08-29T04:27:26Z

If you notice, I change tox configuration to exclude Python 2.x
My question, is this target is still valuable to your? Are you expecting that some reasonable user still use Python 2.x nowadays?

With 3.x branch it is not so clear for me what proper minimum version should be.

dchaplinsky · 2023-08-29T07:42:09Z

No, it's not

kant2002 · 2023-08-29T07:52:06Z

What about minimum Python 3 version? I use 3.6 just in case even if I would not use it.

dchaplinsky · 2023-08-29T10:07:28Z

3.6 is fine. Feel free to drop it if it causes troubles.

Was fun to learn, that 3.6 is still by far the most popular version: https://w3techs.com/technologies/history_details/pl-python/3

Make tests pass

0f61774

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tests pass#11

Make tests pass#11
kant2002 wants to merge 1 commit intolang-uk:languagetoolsfrom
kant2002:languagetools

kant2002 commented Aug 28, 2023

Uh oh!

dchaplinsky commented Aug 28, 2023

Uh oh!

kant2002 commented Aug 29, 2023

Uh oh!

dchaplinsky commented Aug 29, 2023

Uh oh!

kant2002 commented Aug 29, 2023

Uh oh!

dchaplinsky commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kant2002 commented Aug 28, 2023

Uh oh!

dchaplinsky commented Aug 28, 2023

Uh oh!

kant2002 commented Aug 29, 2023

Uh oh!

dchaplinsky commented Aug 29, 2023

Uh oh!

kant2002 commented Aug 29, 2023

Uh oh!

dchaplinsky commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants