Skip to content

Parse XML-style <?target data?> processing instructions#12118

Open
foolip wants to merge 20 commits intomainfrom
foolip/pi-parsing
Open

Parse XML-style <?target data?> processing instructions#12118
foolip wants to merge 20 commits intomainfrom
foolip/pi-parsing

Conversation

@foolip
Copy link
Member

@foolip foolip commented Jan 31, 2026

Notably behavior and their rationale:

  • While <? opens a PI, a > always closes it. This to match the bogus
    comment tokenizer behavior, so that exactly the same characters are
    consumed as bogus comments in existing parsers. Any deviation would
    make this a much riskier parser change.
  • The target must start with ASCII alpha, but after that almost anything
    goes, matching tag names. This doesn't have to be this way, but
    there's no obvious other precedent to follow.
  • The target is ASCII-lowercased by the tokenizer, just like tag and
    attribute names. This is for consistency, and there's precedent from
    the SVG-in-HTML parser behavior.
  • If target is a case-insensitive match for "xml" or "xml-stylesheet",
    it's instead treated as a bogus comment. <?xml?> is not a valid PI
    in XML, and <?xml-stylesheet href="style.css"?> would otherwise
    start loading stylesheets in HTML documents, which we don't want.
  • A ? that's not followed by a > becomes part of the data (never
    the target). This is to match XML for a <?t ???> where data is "??".
  • For <?t?d?> (invalid XML) the target is "t" and data is "?d". This
    behavior results from handling ? the same way wherever it's
    encountered. The example would serialize back as <?t ?d?>.
  • <?> and <? followed by EOF are treated as bogus comments, because
    this is the most conservative choice, and also avoids empty target.

  • At least two implementers are interested (and none opposed):
  • Tests are written and can be reviewed and commented upon at:
  • Implementation bugs are filed:
    • Chromium: …
    • Gecko: …
    • WebKit: …
    • Deno (only for timers, structured clone, base64 utils, channel messaging, module resolution, web workers, and web storage): …
    • Node.js (only for timers, structured clone, base64 utils, channel messaging, and module resolution): …
  • Corresponding HTML AAM & ARIA in HTML issues & PRs:
  • MDN issue is filed: …
  • The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)


/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/syntax.html ( diff )

Notably behavior and their rationale:

- While `<?` opens a PI, a `>` always closes it. This to match the bogus
  comment tokenizer behavior, so that exactly the same characters are
  consumed as bogus comments in existing parsers. Any deviation would
  make this a much riskier parser change.
- The target must start with ASCII alpha, but after that almost anything
  goes, matching tag names. This doesn't have to be this way, but
  there's no obvious precedent to follow.
- The target is ASCII-lowercased by the tokenizer, just like tag and
  attribute names. This is for consistency, and there's precedent from
  the SVG-in-HTML parser behavior.
- A `?` that isn't followed by `>` is treated like whitespace, similar
  to `/` in opening tags. This is also the simplest, with a single
  tokenizer state used whenever `?` is encountered.
- `<?>` and `<?` followed by EOF are treated as bogus comments, because
  this is the most conservative choice, and also avoids empty target.
@foolip foolip force-pushed the foolip/pi-parsing branch from f806ae3 to a26fe99 Compare January 31, 2026 08:09
@foolip foolip marked this pull request as ready for review January 31, 2026 21:15
@foolip
Copy link
Member Author

foolip commented Jan 31, 2026

I've polished some more, this is close to ready now. Needs impl interest and tests, of course.

@noamr
Copy link
Collaborator

noamr commented Jan 31, 2026

I've polished some more, this is close to ready now. Needs impl interest and tests, of course.

Nice! I can have tests ready within a few days.

Copy link
Member

@hsivonen hsivonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@zcorpan
Copy link
Member

zcorpan commented Feb 2, 2026

  • For <?t?d?> (invalid XML) the target is "t" and data is "?d". This
    behavior results from handling ? the same way wherever it's
    encountered. The example would serialize back as <?t ?d?>.

Nit: As specified in https://html.spec.whatwg.org/#serialising-html-fragments it would serialize as <?t ?d> (without trailing ?)

@foolip
Copy link
Member Author

foolip commented Feb 3, 2026

@zcorpan you're absolutely right, I was testing in Chrome and didn't realize the spec says otherwise.

@foolip
Copy link
Member Author

foolip commented Feb 3, 2026

Test for serialization: web-platform-tests/wpt#57486

@zcorpan
Copy link
Member

zcorpan commented Feb 3, 2026

Ah, I did not know Chromium and WebKit serialize with ?>... After discussing with @hsivonen a bit, we think it makes sense to change Gecko to serialize with ?> also.

This opens up for making the ? required for document conformance.

Pros to making it required:

  • It's a clear deliberate ending of the PI, so that an accidental > can be highlighted as a syntax error.
  • Matches XML (except HTML PI can't contain >).

Cons:

  • It's an extra character that's not strictly necessary.

@zcorpan
Copy link
Member

zcorpan commented Feb 3, 2026

Also, maybe we can just support xml-stylesheet in HTML (but do nothing for XSLT)? It's already supported in the DOM: https://software.hixie.ch/utilities/js/live-dom-viewer/saved/14486

I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty.

@noamr
Copy link
Collaborator

noamr commented Feb 3, 2026

Also, maybe we can just support xml-stylesheet in HTML (but do nothing for XSLT)? It's already supported in the DOM: https://software.hixie.ch/utilities/js/live-dom-viewer/saved/14486

I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty.

Yea we could probably also parse that PI for <?xml> without that doing too much damage as nobody is looking at it.

@foolip
Copy link
Member Author

foolip commented Feb 4, 2026

On serializing PIs with just >, the current behavior was spec'd in 5319a9a and https://lists.whatwg.org/pipermail/whatwg-whatwg.org/2008-February/056354.html (same day) doesn't give any more details.

@zcorpan, changing it to ?> and updating web-platform-tests/wpt#57486 would be OK, but I already prepared https://chromium-review.googlesource.com/c/chromium/src/+/7531667, so that door is open.

Regarding conformance, do you mean that a case like <?t name="a>b"?> can be highlighted as a syntax error? That's a good point, so even if you assume that you can put > inside of PIs without escaping, it will be an error. Note that <?t name="a>b"> could also be a syntax error however. It's only with > in an unquoted attribute value like <?t name=a>b> that we can't make it an error, but that case looks like regular element syntax. Am I misunderstanding the argument?

@justinfagnani
Copy link

On the DOM side, does this still create Comment nodes? Otherwise this is certainly not web compatible.

@noamr
Copy link
Collaborator

noamr commented Feb 4, 2026

On the DOM side, does this still create Comment nodes? Otherwise this is certainly not web compatible.

It creates ProcessingInstruction nodes which already exist in DOM. What specifically makes this web incompatible in your view? HTTP archive shows very little use of PI-like syntax and we can exclude problematic targets if we need to.

@justinfagnani
Copy link

In lit-html we use PI syntax to create comments that are parsed inside templates: https://github.com/lit/lit/blob/f243134b226735320b61466cebdaf0c1e574bfa7/packages/lit-html/src/lit-html.ts#L354

These comments are then retrieved with a TreeWalker that shows Comment nodes.

The targets aren't fixed, but have the form lit$$xxxxxxxxx$ where xxxxxxxxx is a 9-digit random number.

@noamr
Copy link
Collaborator

noamr commented Feb 4, 2026

In lit-html we use PI syntax to create comments that are parsed inside templates: https://github.com/lit/lit/blob/f243134b226735320b61466cebdaf0c1e574bfa7/packages/lit-html/src/lit-html.ts#L354

These comments are then retrieved with a TreeWalker that shows Comment nodes.

The targets aren't fixed, but have the form lit$$xxxxxxxxx$ where xxxxxxxxx is a 9-digit random number.

We can easily exclude targets that start with ’lit$’ from this, or any other string that we find to be problematic in terms of compat.

@noamr
Copy link
Collaborator

noamr commented Feb 4, 2026

In lit-html we use PI syntax to create comments that are parsed inside templates: https://github.com/lit/lit/blob/f243134b226735320b61466cebdaf0c1e574bfa7/packages/lit-html/src/lit-html.ts#L354

These comments are then retrieved with a TreeWalker that shows Comment nodes.

The targets aren't fixed, but have the form lit$$xxxxxxxxx$ where xxxxxxxxx is a 9-digit random number.

@foolip suggested that we constrain the syntax instead.
Perhaps something like this would work:

^[a-zA-Z_][\w_\.\-]*$

I think treating invalid targets as bogus comments would work better than block-listing lit$ and would fix this issue.

@noamr
Copy link
Collaborator

noamr commented Feb 4, 2026

In lit-html we use PI syntax to create comments that are parsed inside templates: https://github.com/lit/lit/blob/f243134b226735320b61466cebdaf0c1e574bfa7/packages/lit-html/src/lit-html.ts#L354
These comments are then retrieved with a TreeWalker that shows Comment nodes.
The targets aren't fixed, but have the form lit$$xxxxxxxxx$ where xxxxxxxxx is a 9-digit random number.

@foolip suggested that we constrain the syntax instead. Perhaps something like this would work:

^[a-zA-Z_][\w_\.\-]*$

I think treating invalid targets as bogus comments would work better than block-listing lit$ and would fix this issue.

Or even pithier, /^\p{ID_Start}[\p{ID_Continue}-]*$/uv - allow the unicode range for ID_start, followed by zero or more "ID_Continue or dash" (see https://www.unicode.org/reports/tr31/)

@zcorpan
Copy link
Member

zcorpan commented Feb 4, 2026

Or even pithier, /^\p{ID_Start}[\p{ID_Continue}-]*$/uv - allow the unicode range for ID_start, followed by zero or more "ID_Continue or dash" (see https://www.unicode.org/reports/tr31/)

That includes non-ASCII, right? I think the HTML parser intentionally doesn't branch for non-ASCII anywhere.

@noamr
Copy link
Collaborator

noamr commented Feb 4, 2026

Or even pithier, /^\p{ID_Start}[\p{ID_Continue}-]*$/uv - allow the unicode range for ID_start, followed by zero or more "ID_Continue or dash" (see https://www.unicode.org/reports/tr31/)

That includes non-ASCII, right? I think the HTML parser intentionally doesn't branch for non-ASCII anywhere.

Good point, it can be something like /^[A-Za-z][-A-Za-z0-9]*$/.
We can also allow . and/or _ if we really want to. But perhaps less is more.

@foolip
Copy link
Member Author

foolip commented Feb 4, 2026

Testing XML parser behavior for the first character and second character of a PI target in Chrome/Firefox/Safari, it looks like:

  • First character can be [A-Za-z_] (ASCII alpha or underscore) and a lot of non-ASCII characters.
  • Second character can be those plus 0-9, U+002D (-), U+002E (.) and a lot of non-ASCII characters.

It looks consistent across browsers for ASCII, but I see differences beyond that.

So it matches https://www.w3.org/TR/xml/#NT-Name in the ASCII range, except that ":" isn't allowed because of namespaces.

To fix the <?lit$$123456789> issue the smallest possible change is to specifically look for $ while expecting a target and treat that as a bogus comment.

But since we have to constrain the syntax for compat, I think the safest bet is what @noamr suggested, starts with alpha and continues with alphanumeric or hyphen.

@hsivonen
Copy link
Member

hsivonen commented Feb 4, 2026

But since we have to constrain the syntax for compat, I think the safest bet is what @noamr suggested, starts with alpha and continues with alphanumeric or hyphen.

Starts with ASCII alpha and continues with ASCII alphanumeric or hyphen seems OK.

@foolip
Copy link
Member Author

foolip commented Feb 4, 2026

But since we have to constrain the syntax for compat, I think the safest bet is what @noamr suggested, starts with alpha and continues with alphanumeric or hyphen.

Starts with ASCII alpha and continues with ASCII alphanumeric or hyphen seems OK.

I've made this change in 64f133a.

data-x="syntax-pi-data">data</span> in the next step, there must be one or more <span>ASCII
whitespace</span>.</li>

<li>Optionally, <span data-x="syntax-pi-data">data</span>.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do

Optionally, one ASCII whitespace followed by data.

so we don't need the awkward conditional in the previous step.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"one or more ASCII whitespace" since otherwise the data could start with whitespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what you're saying. Wouldn't the previous step take care of "more"? How would it impact data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the awkward conditional from https://html.spec.whatwg.org/#start-tags step 3.

The problem with combining these steps is that <?target > with unnecessary space should be conforming, so whitespace is always allowed, but required if followed by data. With start tags, this isn't an issue because there's more whitespace allowed after attributes, but there is no after PI data, trailing whitespace is considered part of the PI data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Okay.

"<code data-x="">?&gt;</code>". This is for compatibility with previously specified behavior of
the "<code data-x="">&lt;?</code>" syntax, which was treated as a <span
data-x="bogus comment state">bogus comment</span>, ending with
"<code data-x="">&gt;</code>".</p></li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That we make it conforming is not for compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying that this is the syntax section and not the parser, but if we make the ? optional, it really is because of the previous parser behavior and content that depends on it, right?

Do you think this would be better left without explanation, or perhaps only with reference to existing content?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax section describes what is conforming. It doesn't have to concern itself with what ends up being parsed. We could make ? mandatory to write and validators would complain if it was missing, yet browsers would still end up with PIs in the node tree.

We could even not end up with a PI if it was missing if we wanted to.

I think we should make it optional however as it's convenient. That this is possible is because comments already parsed this way, but it's not a compatibility decision.

</div>

<div algorithm>
<p>To <dfn>replace processing instruction with comment</dfn>:</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you either have to pluralize or use "a PI" and "a comment".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the current processing instruction token, so I can make it "replace current processing instruction with comment" or "replace the current processing instruction with a comment", do you have a preference?

Note the above "flush code points consumed as a character reference" which is a little similar in purpose. I couldn't come up with something very similar though, in this case we're not "flushing" because the characters aren't emitted as character tokens.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's okay to leave as-is. I don't like how much global state we have here that isn't really explained or linked, but that's very much a pre-existing issue.


<dt>EOF</dt>
<dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span>
<span>parse error</span>. Emit an end-of-file token.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to drop the PI completely on the floor. Seems cleaner to emit a bogus comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Start tags are dropped on the floor. To emit a comment we'd have to construct one from the PI or build both a PI and a comment in case of EOF. Why emit a comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what we did before. A PI is more akin to a comment than a start tag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did we do that before? We never went back all the way to the start of the PI to reconstruct it as a comment when it was this far along (only when it was at its first 1 or 2 characters). We'd have to keep all the raw data like whitespaces to make this accurate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed model is to only emit a token when we're sure what it was supposed to be, and if we've only seen <?x we don't know yet if it was going to be <?xyz> (a PI), <?xml> (a special case bogus comment) or <?x$> (a bogus comment).

If we want to preserve the behavior of EOF after <? the parser had previously, we will have to use the temporary buffer with unlimited buffering, since we have to reach the > before we know if it should be a PI or a comment. (We couldn't construct it from the current processing instruction token, for that we'd need to also buffer the whitespace between target and data.)

I think @hsivonen will have opinions on this point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 2cfc613 for what it looked like before I changed it to just emit EOF.

Copy link
Collaborator

@noamr noamr Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed model is to only emit a token when we're sure what it was supposed to be, and if we've only seen <?x we don't know yet if it was going to be <?xyz> (a PI), <?xml> (a special case bogus comment) or <?x$> (a bogus comment).

Yea but this is for the target only. If we want to support emitting comment on EOF after that we have to keep the temporary buffer or some such also for the data/attributes and maintain information about the whitespace between them. It essentially forks the tokenizer to be "PI or comment" until we either see >, EOF, or one of the forbidden characters during the target state. Perhaps that's OK but then let's spec it that way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If buffering until EOF or > was the only alternative due to compat issues we'd just do that, it's not hard to spec. But I do think it's likely that dropping the PI token on EOF is likely to be web compatible, and then we don't need to buffer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, either is fine with me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with dropping if we can get away with it. It will make certain scenarios harder to debug, but that's probably okay.


<dt>EOF</dt>
<dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span>
<span>parse error</span>. Emit an end-of-file token.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, shouldn't we emit a bogus comment at least?


<dt>EOF</dt>
<dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span>
<span>parse error</span>. Emit an end-of-file token.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect something here as well.


<dt>EOF</dt>
<dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span>
<span>parse error</span>. Emit an end-of-file token.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
ajperel pushed a commit to chromium/chromium that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}
@foolip
Copy link
Member Author

foolip commented Feb 13, 2026

We need to decide what to do when we encounter EOF between <? and >. That will determine whether and how much we need to use the temporary buffer, and in my opinion these are the most reasonable options:

Drop PIs on EOF: This is what we do with tags, suggested by @zcorpan in #12118 (comment). This combined with making the target case-preserving alphanumeric+hyphens means we will know at the first non-alphanumeric-or-hyphen if this will be a PI or comment. The characters up to that point will become either the PI target or the start of a bogus comment.

Emit a comment on EOF: This preserves existing behavior in every case that doesn't produce a PI. This will require the temporary buffer for at least the whitespace between target and data, but would be easiest to explain as buffering everything up to >. If we do this, we could easily lowercase the target as well, but they'd still need to be alphanumeric+hyphens so that <?lit$$123456789$> creates a comment, not a PI.

I'll be OOO next week, but it would be great to make a decision. My preference is to drop PIs on EOF, and the others with opinions are probably @annevk, @hsivonen, and @zcorpan.

@zcorpan
Copy link
Member

zcorpan commented Feb 13, 2026

For start tags, it was changed in be72d87

Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html

Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply.

I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat.

@noamr
Copy link
Collaborator

noamr commented Feb 13, 2026

For start tags, it was changed in be72d87

Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html

Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply.

I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat.

Yea creating a comment here makes the arguments from the original email not too relevant because the PI itself is never created. Creating a comment here is slightly more compatible with current behavior and makes debugging slightly easier. The buffering complexity is manageable IMO.

(Though it's also not a huge deal to drop it for simplicity)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants