Parse XML-style <?target data?> processing instructions#12118
Parse XML-style <?target data?> processing instructions#12118
Conversation
Notably behavior and their rationale: - While `<?` opens a PI, a `>` always closes it. This to match the bogus comment tokenizer behavior, so that exactly the same characters are consumed as bogus comments in existing parsers. Any deviation would make this a much riskier parser change. - The target must start with ASCII alpha, but after that almost anything goes, matching tag names. This doesn't have to be this way, but there's no obvious precedent to follow. - The target is ASCII-lowercased by the tokenizer, just like tag and attribute names. This is for consistency, and there's precedent from the SVG-in-HTML parser behavior. - A `?` that isn't followed by `>` is treated like whitespace, similar to `/` in opening tags. This is also the simplest, with a single tokenizer state used whenever `?` is encountered. - `<?>` and `<?` followed by EOF are treated as bogus comments, because this is the most conservative choice, and also avoids empty target.
f806ae3 to
a26fe99
Compare
|
I've polished some more, this is close to ready now. Needs impl interest and tests, of course. |
Nice! I can have tests ready within a few days. |
Nit: As specified in https://html.spec.whatwg.org/#serialising-html-fragments it would serialize as |
|
@zcorpan you're absolutely right, I was testing in Chrome and didn't realize the spec says otherwise. |
|
Test for serialization: web-platform-tests/wpt#57486 |
|
Ah, I did not know Chromium and WebKit serialize with This opens up for making the Pros to making it required:
Cons:
|
|
Also, maybe we can just support I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty. |
Yea we could probably also parse that PI for |
|
On serializing PIs with just @zcorpan, changing it to Regarding conformance, do you mean that a case like |
|
On the DOM side, does this still create Comment nodes? Otherwise this is certainly not web compatible. |
It creates ProcessingInstruction nodes which already exist in DOM. What specifically makes this web incompatible in your view? HTTP archive shows very little use of PI-like syntax and we can exclude problematic targets if we need to. |
|
In lit-html we use PI syntax to create comments that are parsed inside templates: https://github.com/lit/lit/blob/f243134b226735320b61466cebdaf0c1e574bfa7/packages/lit-html/src/lit-html.ts#L354 These comments are then retrieved with a TreeWalker that shows Comment nodes. The targets aren't fixed, but have the form |
We can easily exclude targets that start with ’lit$’ from this, or any other string that we find to be problematic in terms of compat. |
@foolip suggested that we constrain the syntax instead. ^[a-zA-Z_][\w_\.\-]*$I think treating invalid targets as bogus comments would work better than block-listing |
Or even pithier, |
That includes non-ASCII, right? I think the HTML parser intentionally doesn't branch for non-ASCII anywhere. |
Good point, it can be something like |
|
Testing XML parser behavior for the first character and second character of a PI target in Chrome/Firefox/Safari, it looks like:
It looks consistent across browsers for ASCII, but I see differences beyond that. So it matches https://www.w3.org/TR/xml/#NT-Name in the ASCII range, except that ":" isn't allowed because of namespaces. To fix the But since we have to constrain the syntax for compat, I think the safest bet is what @noamr suggested, starts with alpha and continues with alphanumeric or hyphen. |
Starts with ASCII alpha and continues with ASCII alphanumeric or hyphen seems OK. |
| data-x="syntax-pi-data">data</span> in the next step, there must be one or more <span>ASCII | ||
| whitespace</span>.</li> | ||
|
|
||
| <li>Optionally, <span data-x="syntax-pi-data">data</span>.</li> |
There was a problem hiding this comment.
Maybe we could do
Optionally, one ASCII whitespace followed by data.
so we don't need the awkward conditional in the previous step.
There was a problem hiding this comment.
"one or more ASCII whitespace" since otherwise the data could start with whitespace.
There was a problem hiding this comment.
I'm not sure I follow what you're saying. Wouldn't the previous step take care of "more"? How would it impact data?
There was a problem hiding this comment.
I copied the awkward conditional from https://html.spec.whatwg.org/#start-tags step 3.
The problem with combining these steps is that <?target > with unnecessary space should be conforming, so whitespace is always allowed, but required if followed by data. With start tags, this isn't an issue because there's more whitespace allowed after attributes, but there is no after PI data, trailing whitespace is considered part of the PI data.
| "<code data-x="">?></code>". This is for compatibility with previously specified behavior of | ||
| the "<code data-x=""><?</code>" syntax, which was treated as a <span | ||
| data-x="bogus comment state">bogus comment</span>, ending with | ||
| "<code data-x="">></code>".</p></li> |
There was a problem hiding this comment.
That we make it conforming is not for compatibility.
There was a problem hiding this comment.
I see what you're saying that this is the syntax section and not the parser, but if we make the ? optional, it really is because of the previous parser behavior and content that depends on it, right?
Do you think this would be better left without explanation, or perhaps only with reference to existing content?
There was a problem hiding this comment.
The syntax section describes what is conforming. It doesn't have to concern itself with what ends up being parsed. We could make ? mandatory to write and validators would complain if it was missing, yet browsers would still end up with PIs in the node tree.
We could even not end up with a PI if it was missing if we wanted to.
I think we should make it optional however as it's convenient. That this is possible is because comments already parsed this way, but it's not a compatibility decision.
| </div> | ||
|
|
||
| <div algorithm> | ||
| <p>To <dfn>replace processing instruction with comment</dfn>:</p> |
There was a problem hiding this comment.
I think you either have to pluralize or use "a PI" and "a comment".
There was a problem hiding this comment.
It's the current processing instruction token, so I can make it "replace current processing instruction with comment" or "replace the current processing instruction with a comment", do you have a preference?
Note the above "flush code points consumed as a character reference" which is a little similar in purpose. I couldn't come up with something very similar though, in this case we're not "flushing" because the characters aren't emitted as character tokens.
There was a problem hiding this comment.
I guess it's okay to leave as-is. I don't like how much global state we have here that isn't really explained or linked, but that's very much a pre-existing issue.
|
|
||
| <dt>EOF</dt> | ||
| <dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span> | ||
| <span>parse error</span>. Emit an end-of-file token.</dd> |
There was a problem hiding this comment.
It seems weird to drop the PI completely on the floor. Seems cleaner to emit a bogus comment?
There was a problem hiding this comment.
I disagree. Start tags are dropped on the floor. To emit a comment we'd have to construct one from the PI or build both a PI and a comment in case of EOF. Why emit a comment?
There was a problem hiding this comment.
That's what we did before. A PI is more akin to a comment than a start tag.
There was a problem hiding this comment.
How did we do that before? We never went back all the way to the start of the PI to reconstruct it as a comment when it was this far along (only when it was at its first 1 or 2 characters). We'd have to keep all the raw data like whitespaces to make this accurate.
There was a problem hiding this comment.
The proposed model is to only emit a token when we're sure what it was supposed to be, and if we've only seen <?x we don't know yet if it was going to be <?xyz> (a PI), <?xml> (a special case bogus comment) or <?x$> (a bogus comment).
If we want to preserve the behavior of EOF after <? the parser had previously, we will have to use the temporary buffer with unlimited buffering, since we have to reach the > before we know if it should be a PI or a comment. (We couldn't construct it from the current processing instruction token, for that we'd need to also buffer the whitespace between target and data.)
I think @hsivonen will have opinions on this point.
There was a problem hiding this comment.
See 2cfc613 for what it looked like before I changed it to just emit EOF.
There was a problem hiding this comment.
The proposed model is to only emit a token when we're sure what it was supposed to be, and if we've only seen
<?xwe don't know yet if it was going to be<?xyz>(a PI),<?xml>(a special case bogus comment) or<?x$>(a bogus comment).
Yea but this is for the target only. If we want to support emitting comment on EOF after that we have to keep the temporary buffer or some such also for the data/attributes and maintain information about the whitespace between them. It essentially forks the tokenizer to be "PI or comment" until we either see >, EOF, or one of the forbidden characters during the target state. Perhaps that's OK but then let's spec it that way?
There was a problem hiding this comment.
If buffering until EOF or > was the only alternative due to compat issues we'd just do that, it's not hard to spec. But I do think it's likely that dropping the PI token on EOF is likely to be web compatible, and then we don't need to buffer.
There was a problem hiding this comment.
OK, either is fine with me.
There was a problem hiding this comment.
I'm okay with dropping if we can get away with it. It will make certain scenarios harder to debug, but that's probably okay.
|
|
||
| <dt>EOF</dt> | ||
| <dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span> | ||
| <span>parse error</span>. Emit an end-of-file token.</dd> |
There was a problem hiding this comment.
Same here, shouldn't we emit a bogus comment at least?
|
|
||
| <dt>EOF</dt> | ||
| <dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span> | ||
| <span>parse error</span>. Emit an end-of-file token.</dd> |
There was a problem hiding this comment.
I would expect something here as well.
|
|
||
| <dt>EOF</dt> | ||
| <dd>This is an <span data-x="parse-error-eof-in-processing-instruction">eof-in-processing-instruction</span> | ||
| <span>parse error</span>. Emit an end-of-file token.</dd> |
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}
|
We need to decide what to do when we encounter EOF between Drop PIs on EOF: This is what we do with tags, suggested by @zcorpan in #12118 (comment). This combined with making the target case-preserving alphanumeric+hyphens means we will know at the first non-alphanumeric-or-hyphen if this will be a PI or comment. The characters up to that point will become either the PI target or the start of a bogus comment. Emit a comment on EOF: This preserves existing behavior in every case that doesn't produce a PI. This will require the temporary buffer for at least the whitespace between target and data, but would be easiest to explain as buffering everything up to I'll be OOO next week, but it would be great to make a decision. My preference is to drop PIs on EOF, and the others with opinions are probably @annevk, @hsivonen, and @zcorpan. |
|
For start tags, it was changed in be72d87 Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply. I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat. |
Yea creating a comment here makes the arguments from the original email not too relevant because the PI itself is never created. Creating a comment here is slightly more compatible with current behavior and makes debugging slightly easier. The buffering complexity is manageable IMO. (Though it's also not a huge deal to drop it for simplicity) |
Notably behavior and their rationale:
<?opens a PI, a>always closes it. This to match the boguscomment tokenizer behavior, so that exactly the same characters are
consumed as bogus comments in existing parsers. Any deviation would
make this a much riskier parser change.
goes, matching tag names. This doesn't have to be this way, but
there's no obvious other precedent to follow.
attribute names. This is for consistency, and there's precedent from
the SVG-in-HTML parser behavior.
it's instead treated as a bogus comment.
<?xml?>is not a valid PIin XML, and
<?xml-stylesheet href="style.css"?>would otherwisestart loading stylesheets in HTML documents, which we don't want.
?that's not followed by a>becomes part of the data (neverthe target). This is to match XML for a
<?t ???>where data is "??".<?t?d?>(invalid XML) the target is "t" and data is "?d". Thisbehavior results from handling
?the same way wherever it'sencountered. The example would serialize back as
<?t ?d?>.<?>and<?followed by EOF are treated as bogus comments, becausethis is the most conservative choice, and also avoids empty target.
(See WHATWG Working Mode: Changes for more details.)
/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/syntax.html ( diff )