Unicode-based split of words and graphemes #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What:
Why:
[\w\\']+only matches latin alphabets, thus non-latin inputs were not processed at all.&str.len()is not always the same as the length of the Unicode graphemes, and indices instyle_substrwere calculated wrongly for multibyte characters.How:
unicode_word_indices&strslice toUnicodeSegmentation::graphemesTests:
echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrxecho "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrxecho '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrxChecklist:
Allow edits from maintainersoption checked[your_username]/(ex.coloradocolby/featureX)Caveat:
fsrx's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.)xfce4-terminal.D2Codingfont, but other fonts such as Noto will work as well.