Improve error handling in parsing #11

xeniagda · 2023-04-19T20:02:38Z

This pull request does two things: introduce an error enum for possible failures during parsing, as well as track locations for all tokens to give information for errors. A few tests have been added to ensure the validity of the error tracking.

I made the decision to track locations as char indices rather than bytes in the source. This is mainly because this makes the tracking easier to write — we can simply call .enumerate() on the HTML in the html_to_stack function. I have tried to ensure that location gathering will never be O(n²), which could occur if you need to "count backwards" to see how long a thing you've kept in memory is in chars.

…ing (probably) infallilable

xeniagda · 2023-04-19T20:03:52Z

1e2aa7d also fixes a slight bug in the Token::from(tag) function, where a tag consisting of only spaces would not be caught by the check because the suffix >//> in the tag would count to the name of the tag.

xeniagda · 2023-04-19T20:05:14Z

We could trivially forward the location information to the DOM itself. I think this could be useful for any application which manipulates HTML provided by the user. Any time the user provides a DOM which parses but is considered invalid by the application, they would want to give the user information about what node caused the problem. Being able to point back into the original file would be quite useful here. However, for applications where this isn't the case (such as programs generating or modifying HTML), this could be an added complexity which might be inconvenient for the user. To store the location for each node type, we would probably have to change the Node type to a struct containing an Option<SourceLocation> as well as an instance of the old enum. This extra layer of types could be slightly inconvenient to manipulate, especially if the SourceLocation isn't needed.

Another, more convenient way would be to add the Option<SourceLocation> to the Element struct. This already has a few fields and could easily be ignored for applications which do not need the information. However, this would mean things like text nodes, comments and doctypes would lose the information.

A third, slightly more radical way would be to change all instances of String in the Node type and its descendents to a struct containing the String as well as the Option<SourceLocation>. This type could maybe Deref to a String. This would mean every single aspect of a Node, such as attribute keys and values, would be tracked.

I will not make any choice here, but I might implement one of these approaches in my fork to use in an application I'm developing.

loovjo added 3 commits April 19, 2023 21:48

Introduce (Inner)HTMLParseError, track locations in source, make pars…

d3d3050

…ing (probably) infallilable

Don't consider the > and /> part of tag name

1e2aa7d

Test parse errors

9326e1d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve error handling in parsing #11

Improve error handling in parsing #11

Uh oh!

xeniagda commented Apr 19, 2023

Uh oh!

xeniagda commented Apr 19, 2023

Uh oh!

xeniagda commented Apr 19, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Improve error handling in parsing #11

Are you sure you want to change the base?

Improve error handling in parsing #11

Uh oh!

Conversation

xeniagda commented Apr 19, 2023

Uh oh!

xeniagda commented Apr 19, 2023

Uh oh!

xeniagda commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xeniagda commented Apr 19, 2023 •

edited

Loading