Skip to content

Comments

Fix off-by-one excluding U+10FFFF from valid Unicode range in emitter#918

Open
bysiber wants to merge 1 commit intoyaml:mainfrom
bysiber:fix/emitter-unicode-range-off-by-one
Open

Fix off-by-one excluding U+10FFFF from valid Unicode range in emitter#918
bysiber wants to merge 1 commit intoyaml:mainfrom
bysiber:fix/emitter-unicode-range-off-by-one

Conversation

@bysiber
Copy link

@bysiber bysiber commented Feb 20, 2026

Summary

The emitter's analyze_scalar method has an off-by-one error that excludes U+10FFFF from the valid Unicode range, causing it to be treated as a "special character" and forcing double-quoted scalar style.

Problem

The supplementary plane check uses strict less-than:

or '\U00010000' <= ch < '\U0010ffff'

This excludes U+10FFFF, which is a valid Unicode code point per the YAML spec's c-printable production (YAML 1.1 section 4.1):

c-printable ::= ... | [#x10000-#x10FFFF]

Meanwhile, reader.py correctly uses an inclusive upper bound in its character validation regex:

'\U00010000-\U0010ffff'

This creates an inconsistency: the reader accepts U+10FFFF, but the emitter treats it as special, unnecessarily forcing double-quoted style when the scalar could use plain, single-quoted, or block styles.

Demonstration

import yaml, io
from yaml.emitter import Emitter

e = Emitter(io.StringIO(), allow_unicode=True)

# U+10FFFE: correctly treated as valid unicode
a1 = e.analyze_scalar('\U0010fffe')
print(a1.allow_single_quoted)  # True

# U+10FFFF: incorrectly treated as special character
a2 = e.analyze_scalar('\U0010ffff')
print(a2.allow_single_quoted)  # False — should be True

Fix

Change < to <= to include U+10FFFF in the valid range, matching both the YAML spec and the reader's validation.

The supplementary plane range check in analyze_scalar uses strict
less-than (< U+10FFFF) instead of less-than-or-equal (<= U+10FFFF).
This excludes U+10FFFF, a valid Unicode code point per the YAML spec's
c-printable production, causing it to be treated as a special character
and unnecessarily forcing double-quoted style.

The reader module correctly includes U+10FFFF in its acceptance range.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant