Upgrade uax29 for Unicode 16 support #15

clipperhouse · 2026-01-25T18:24:22Z

Behavior should be unchanged. Text that includes Indic_Conjunct_Break may be different (and more correct).

Copilot

Pull request overview

This PR upgrades the uax29 dependency from v2.3.0 to v2.4.0 to add Unicode 16 support for grapheme cluster segmentation. The PR description indicates behavior should remain unchanged except for text containing Indic_Conjunct_Break characters, which will be handled more correctly.

Changes:

Upgraded github.com/clipperhouse/uax29/v2 from v2.3.0 to v2.4.0 in the main module
Updated the same dependency in the comparison module for compatibility testing

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated no comments.

File	Description
go.mod	Updated uax29 dependency version to v2.4.0 in main module
go.sum	Updated checksums for uax29 v2.4.0 in main module
comparison/go.mod	Updated uax29 dependency version to v2.4.0 in comparison module
comparison/go.sum	Updated checksums for uax29 v2.4.0 in comparison module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

clipperhouse · 2026-01-25T18:37:46Z

From Claude Opus 4.5 “Is it correct that they should be width 1? What do text renderers and terminals do? Research.”

Based on my research, yes, Indic scripts should be width 1. Here's the evidence:
Unicode East Asian Width Standard
From the official Unicode 16.0 EastAsianWidth.txt file, all Indic scripts are classified as "N" (Neutral):
Devanagari (U+0900-U+097F): N (Neutral)
Bengali (U+0980-U+09FF): N
Tamil (U+0B80-U+0BFF): N
Telugu (U+0C00-U+0C7F): N
Kannada (U+0C80-U+0CFF): N
Malayalam (U+0D00-U+0D7F): N
Gujarati (U+0A80-U+0AFF): N
wcwidth Specification
Per the wcwidth library documentation:
Width 2: Only for East Asian Fullwidth ('F') or Wide ('W') properties - primarily CJK and emoji
Width 1: Everything else (Neutral, Narrow, Halfwidth)
Width 0: Combining marks, control characters, zero-width chars
Terminal Behavior
The go-runewidth library has a documented issue where Devanagari string "खा" (kha) returns width 1, and this is actually correct per the standard. The confusion arose because:
Each grapheme cluster is width 1 (correct)
The issue was about grapheme segmentation, not the width value itself
Conclusion
The tests I wrote have correct expected widths. Indic scripts are:
Width 1 per grapheme cluster (per Unicode East Asian Width)
The key test is that Indic_Conjunct_Break ensures proper grapheme clustering (e.g., क्ष forms 1 cluster, not 3 separate runes)
The tests correctly verify:
Grapheme cluster formation (the Indic_Conjunct_Break behavior)
Width 1 per cluster (per Unicode standard)

clipperhouse · 2026-01-25T18:41:13Z

Further from Claude Opus 4.5

"What does Neutral mean, in terms of guidance for width?”

Here's the key guidance from Unicode UAX #11 on Neutral characters:
Neutral (N) - "Not East Asian"
From the official Unicode specification:

ED7. Neutral (Not East Asian): All other characters. Neutral characters do not occur in legacy East Asian character sets. By extension, they also do not tend to occur in East Asian typography. For example, there is no traditional Japanese way of typesetting Devanagari.
The critical guidance:
Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but because for all practical purposes they behave like Na (Narrow), they are treated as narrow characters (the same as Na) under the recommendations.
Recommendations for Neutral Characters
From Section 5 "Recommendations":
Context Treatment
Processing/display Narrow characters (including Neutral) behave like Western characters. In fixed-pitch East Asian fonts, they take up 1/2 Em of space.
Non-East Asian encodings Neutral Unicode characters always map to regular (narrow) characters
East Asian encodings Neutral Unicode characters always map to halfwidth characters
Summary
Neutral = Width 1 in terminal/display contexts because:
Neutral characters "behave like Na" (Narrow)
They take up 1/2 Em in fixed-pitch fonts (same as width 1)
They don't occur in East Asian legacy character sets
Devanagari, Bengali, Tamil, etc. are explicitly mentioned as examples of scripts that don't have "traditional East Asian typography"
So the tests are correct: Indic scripts are Neutral → treated as width 1.

clipperhouse · 2026-01-25T18:47:32Z

PR Review (Claude Opus 4.5)

Verdict: Approve ✓

This is a clean dependency upgrade with excellent test coverage.

Summary

Upgrades uax29 from v2.3.0 to v2.4.0 for Unicode 16.0 support
Adds 267 lines of tests covering Indic_Conjunct_Break property across 7 Indic scripts (Devanagari, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam)
All tests pass

Strengths

Comprehensive coverage: 23 test cases including conjuncts, words, mixed scripts, and edge cases (virama, vowel signs)
Correct width expectations: Indic scripts correctly treated as width 1 per UAX 11 (Neutral → Narrow)
Script-aware handling: Appropriately accounts for Tamil/Kannada virama differences
Multi-layer verification: Tests validate String/Bytes width, cluster count, width sums, and string reconstruction

No bugs found

The research in the PR comments correctly explains the Unicode specification. The implementation is sound.

Upgrade uax29 for Unicode 16

7d1d4bd

Copilot AI review requested due to automatic review settings January 25, 2026 18:24

Copilot started reviewing on behalf of clipperhouse January 25, 2026 18:24 View session

Copilot AI reviewed Jan 25, 2026

View reviewed changes

Add tests

3606136

clipperhouse merged commit 862a6d1 into main Jan 25, 2026
9 checks passed

clipperhouse deleted the unicode-16 branch January 25, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade uax29 for Unicode 16 support #15

Upgrade uax29 for Unicode 16 support #15

clipperhouse commented Jan 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

clipperhouse commented Jan 25, 2026

Uh oh!

clipperhouse commented Jan 25, 2026

Uh oh!

clipperhouse commented Jan 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Upgrade uax29 for Unicode 16 support #15

Upgrade uax29 for Unicode 16 support #15

Conversation

clipperhouse commented Jan 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

clipperhouse commented Jan 25, 2026

Uh oh!

clipperhouse commented Jan 25, 2026

Uh oh!

clipperhouse commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review (Claude Opus 4.5)

Summary

Strengths

No bugs found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

clipperhouse commented Jan 25, 2026 •

edited

Loading