fix: include Korean Hangul and Japanese kana in CJK token heuristic by haosenwang1018 · Pull Request #2 · aeromomo/claw-compactor

haosenwang1018 · 2026-02-24T18:01:37Z

Problem

The CJK-aware token estimation regex in lib/tokens.py only matches Chinese characters (U+4E00–U+9FFF, U+3400–U+4DBF) and fullwidth forms. It misses:

Korean Hangul syllables (U+AC00–U+D7AF) — e.g. 안녕하세요
Japanese Hiragana (U+3040–U+309F) — e.g. こんにちは
Japanese Katakana (U+30A0–U+30FF) — e.g. カタカナ

This causes Korean and Japanese text to fall through to the ASCII estimation rate (~4 chars/token) instead of the CJK rate (~1.5 chars/token), producing significantly underestimated token counts for these languages.

Fix

Extended _CJK_RE to include Hangul Syllables, Hiragana, and Katakana Unicode ranges.

Tests

Added 4 new tests:

test_korean — basic Korean text estimation
test_japanese_hiragana — Hiragana text
test_japanese_katakana — Katakana text
test_cjk_heuristic_covers_all_scripts — validates Hangul gets CJK rates in heuristic mode

The CJK-aware token estimation regex only matched Chinese characters and fullwidth forms, missing Korean Hangul syllables (U+AC00-U+D7AF) and Japanese Hiragana (U+3040-U+309F) / Katakana (U+30A0-U+30FF). This caused Korean and Japanese text to be estimated at the ASCII rate (~4 chars/token) instead of the CJK rate (~1.5 chars/token), producing significantly lower token counts. Added Hangul Syllables, Hiragana, and Katakana ranges to the regex and added corresponding tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: include Korean Hangul and Japanese kana in CJK token heuristic#2

fix: include Korean Hangul and Japanese kana in CJK token heuristic#2
haosenwang1018 wants to merge 1 commit intoaeromomo:mainfrom
haosenwang1018:fix/cjk-regex-korean-japanese

haosenwang1018 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

haosenwang1018 commented Feb 24, 2026

Problem

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant