Skip to content

Comments

fix: include Korean Hangul and Japanese kana in CJK token heuristic#2

Open
haosenwang1018 wants to merge 1 commit intoaeromomo:mainfrom
haosenwang1018:fix/cjk-regex-korean-japanese
Open

fix: include Korean Hangul and Japanese kana in CJK token heuristic#2
haosenwang1018 wants to merge 1 commit intoaeromomo:mainfrom
haosenwang1018:fix/cjk-regex-korean-japanese

Conversation

@haosenwang1018
Copy link

Problem

The CJK-aware token estimation regex in lib/tokens.py only matches Chinese characters (U+4E00–U+9FFF, U+3400–U+4DBF) and fullwidth forms. It misses:

  • Korean Hangul syllables (U+AC00–U+D7AF) — e.g. 안녕하세요
  • Japanese Hiragana (U+3040–U+309F) — e.g. こんにちは
  • Japanese Katakana (U+30A0–U+30FF) — e.g. カタカナ

This causes Korean and Japanese text to fall through to the ASCII estimation rate (~4 chars/token) instead of the CJK rate (~1.5 chars/token), producing significantly underestimated token counts for these languages.

Fix

Extended _CJK_RE to include Hangul Syllables, Hiragana, and Katakana Unicode ranges.

Tests

Added 4 new tests:

  • test_korean — basic Korean text estimation
  • test_japanese_hiragana — Hiragana text
  • test_japanese_katakana — Katakana text
  • test_cjk_heuristic_covers_all_scripts — validates Hangul gets CJK rates in heuristic mode

The CJK-aware token estimation regex only matched Chinese characters and
fullwidth forms, missing Korean Hangul syllables (U+AC00-U+D7AF) and
Japanese Hiragana (U+3040-U+309F) / Katakana (U+30A0-U+30FF).

This caused Korean and Japanese text to be estimated at the ASCII rate
(~4 chars/token) instead of the CJK rate (~1.5 chars/token), producing
significantly lower token counts.

Added Hangul Syllables, Hiragana, and Katakana ranges to the regex and
added corresponding tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant