Unicode-based split of words and graphemes #4

ichianr · 2022-06-09T08:33:42Z

What:

Inputs with non-latin alphabets are now processed correctly.

Why:

The original regex [\w\\']+ only matches latin alphabets, thus non-latin inputs were not processed at all.
The &str.len() is not always the same as the length of the Unicode graphemes, and indices in style_substr were calculated wrongly for multibyte characters.

How:

Changed the word-split algorithm from regex to unicode_word_indices
Changed the character indexing from &str slice to UnicodeSegmentation::graphemes

Tests:

English
echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrx
French
echo "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrx
Korean
echo '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrx

Checklist:

Allow edits from maintainers option checked
Branch name is prefixed with [your_username]/ (ex. coloradocolby/featureX)
Documentation added
Tests added
No failing actions
Merge ready

Caveat:

I comfirmed that languages putting spaces between words are processed quite similarly to English. However, it seems that fsrx's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.)
Some terminal emulators (e.g., Alacritty, if I remember correctly) may not properly support Unicode input / output. I tested my code with xfce4-terminal.
For non-latin alphabets, I tested my code with D2Coding font, but other fonts such as Noto will work as well.

jrnxf · 2022-06-09T17:30:07Z

damn @ichianr this looks amazing. I don't have time rn to look it over but will tonight!

Unicode-based split of words and graphemes

8395a72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode-based split of words and graphemes #4

Unicode-based split of words and graphemes #4

Uh oh!

ichianr commented Jun 9, 2022 •

edited

Loading

Uh oh!

jrnxf commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Unicode-based split of words and graphemes #4

Are you sure you want to change the base?

Unicode-based split of words and graphemes #4

Uh oh!

Conversation

ichianr commented Jun 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrnxf commented Jun 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ichianr commented Jun 9, 2022 •

edited

Loading