Skip to content

Conversation

@clipperhouse
Copy link
Owner

Behavior should be unchanged. Text that includes Indic_Conjunct_Break may be different (and more correct).

Copilot AI review requested due to automatic review settings January 25, 2026 18:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the uax29 dependency from v2.3.0 to v2.4.0 to add Unicode 16 support for grapheme cluster segmentation. The PR description indicates behavior should remain unchanged except for text containing Indic_Conjunct_Break characters, which will be handled more correctly.

Changes:

  • Upgraded github.com/clipperhouse/uax29/v2 from v2.3.0 to v2.4.0 in the main module
  • Updated the same dependency in the comparison module for compatibility testing

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated no comments.

File Description
go.mod Updated uax29 dependency version to v2.4.0 in main module
go.sum Updated checksums for uax29 v2.4.0 in main module
comparison/go.mod Updated uax29 dependency version to v2.4.0 in comparison module
comparison/go.sum Updated checksums for uax29 v2.4.0 in comparison module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@clipperhouse
Copy link
Owner Author

From Claude Opus 4.5 “Is it correct that they should be width 1? What do text renderers and terminals do? Research.”

Based on my research, yes, Indic scripts should be width 1. Here's the evidence:
Unicode East Asian Width Standard
From the official Unicode 16.0 EastAsianWidth.txt file, all Indic scripts are classified as "N" (Neutral):
Devanagari (U+0900-U+097F): N (Neutral)
Bengali (U+0980-U+09FF): N
Tamil (U+0B80-U+0BFF): N
Telugu (U+0C00-U+0C7F): N
Kannada (U+0C80-U+0CFF): N
Malayalam (U+0D00-U+0D7F): N
Gujarati (U+0A80-U+0AFF): N
wcwidth Specification
Per the wcwidth library documentation:
Width 2: Only for East Asian Fullwidth ('F') or Wide ('W') properties - primarily CJK and emoji
Width 1: Everything else (Neutral, Narrow, Halfwidth)
Width 0: Combining marks, control characters, zero-width chars
Terminal Behavior
The go-runewidth library has a documented issue where Devanagari string "खा" (kha) returns width 1, and this is actually correct per the standard. The confusion arose because:
Each grapheme cluster is width 1 (correct)
The issue was about grapheme segmentation, not the width value itself
Conclusion
The tests I wrote have correct expected widths. Indic scripts are:
Width 1 per grapheme cluster (per Unicode East Asian Width)
The key test is that Indic_Conjunct_Break ensures proper grapheme clustering (e.g., क्ष forms 1 cluster, not 3 separate runes)
The tests correctly verify:
Grapheme cluster formation (the Indic_Conjunct_Break behavior)
Width 1 per cluster (per Unicode standard)

@clipperhouse
Copy link
Owner Author

Further from Claude Opus 4.5

"What does Neutral mean, in terms of guidance for width?”

Here's the key guidance from Unicode UAX #11 on Neutral characters:
Neutral (N) - "Not East Asian"
From the official Unicode specification:

ED7. Neutral (Not East Asian): All other characters. Neutral characters do not occur in legacy East Asian character sets. By extension, they also do not tend to occur in East Asian typography. For example, there is no traditional Japanese way of typesetting Devanagari.
The critical guidance:
Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na (Narrow), they are treated as narrow characters (the same as Na) under the recommendations.
Recommendations for Neutral Characters
From Section 5 "Recommendations":
Context Treatment
Processing/display Narrow characters (including Neutral) behave like Western characters. In fixed-pitch East Asian fonts, they take up 1/2 Em of space.
Non-East Asian encodings Neutral Unicode characters always map to regular (narrow) characters
East Asian encodings Neutral Unicode characters always map to halfwidth characters
Summary
Neutral = Width 1 in terminal/display contexts because:
Neutral characters "behave like Na" (Narrow)
They take up 1/2 Em in fixed-pitch fonts (same as width 1)
They don't occur in East Asian legacy character sets
Devanagari, Bengali, Tamil, etc. are explicitly mentioned as examples of scripts that don't have "traditional East Asian typography"
So the tests are correct: Indic scripts are Neutral → treated as width 1.

@clipperhouse
Copy link
Owner Author

clipperhouse commented Jan 25, 2026

PR Review (Claude Opus 4.5)

Verdict: Approve

This is a clean dependency upgrade with excellent test coverage.

Summary

  • Upgrades uax29 from v2.3.0 to v2.4.0 for Unicode 16.0 support
  • Adds 267 lines of tests covering Indic_Conjunct_Break property across 7 Indic scripts (Devanagari, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam)
  • All tests pass

Strengths

  1. Comprehensive coverage: 23 test cases including conjuncts, words, mixed scripts, and edge cases (virama, vowel signs)
  2. Correct width expectations: Indic scripts correctly treated as width 1 per UAX 11 (Neutral → Narrow)
  3. Script-aware handling: Appropriately accounts for Tamil/Kannada virama differences
  4. Multi-layer verification: Tests validate String/Bytes width, cluster count, width sums, and string reconstruction

No bugs found

The research in the PR comments correctly explains the Unicode specification. The implementation is sound.

@clipperhouse clipperhouse merged commit 862a6d1 into main Jan 25, 2026
9 checks passed
@clipperhouse clipperhouse deleted the unicode-16 branch January 25, 2026 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants