[WIP] [Tinkering] A lossless parser #199
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently there are two ways to parse text in jomini:
text::TokenReaderan implementation focused on performant and streaming deserialization of save files. Very little interpretation of the data is performed at this level. Analogous to a lexer.TextTaperequires all the input up front and while it parses into a linear tape, it will attempt to assign semantic meaning like object/array detection and it's bounds, skipping empty objects, etc. These semantics form the base of the text mid-level API (ObjectReader) where one can traverse and recurse into objects and arrays. This mid-level API is even responsible for converting the text into JSON.TextTapecame first and much time was spent on making it performant for save files. But now that all the save file parsers (or the performance-sensitive ones at least) have moved to the being solely based on lexing the input and not on parsing it,TextTapeoccupies an odd spot.TextTapeis still lossy with an API that either makes mutation cumbersome or impossible.What if a new parser was created with the intent of lossless parsing that offered a more ergonomic API for transformation and was resilient in the face of errors, what would it look like? And importantly can it still be made fast enough to supplant all of
TextTapeuse cases?Starting with the lexer, we need to capture trivia such as whitespace and comments such that one can print out the stream of tokens and receive the exact input back. I imagine it looking like something from logos
Then from stream of tokens we construct a concrete syntax tree with rowan.
I'm imagining non-terminals nodes to be:
{ ... }[ ... ]What other nodes am I missing?
From here I'm murky on implementations, but I want to satisfy some notion of backwards compatibility with more features:
Use cases like JSON output and serde deserialization (where the current test suite should be maintained) do not care about trivia like whitespace and comments. Should there be a way to offer an API exposes a trivia-less structure (it can still be rooted in trivia, just exposing a filter on top)
Conversion to a DOM. Something equivalent to serde_json::Value, but more focused on lossless data (eg: it is important to distinguish quoted from unquoted values).
Transformations. Two example transformation use cases should be easy to code up and they should be lossless in respective to trivia. One should be replacing the exists operator with the equals operator. The other one should be an interpolation, which takes all the variable references found in assignments and expressions and resolves them to their value and computes the expression.
should be interpolated as:
These transformation need to return the same interface that allows one to serialize as json, deserialize, convert into a dom, or apply more transformations. How should these transformations be structured not just for these two examples but for future use cases, like a formatter who may want to preserve comment trivia but discard other trivia?
To be resilient to errors, it seems like a good idea to allow a partial tree or an one filled with errors so that once can still semi-process it. There'll always be new syntax.
With all of these features, seems reasonable if someone wanted to build an LSP on top (some sort of incremental parsing seems desirable)
For reference: