LL(1) Grammar
This package uses a modified version of the KDL grammar as defined in the KDL spec. This modified version is an LL(1) grammar.
The KDL string is pre-processed before it is passed through the parser. This pre-processing takes care of stripping any leading BOM from the text and validates that no illegal characters are present in the text. This simplifies the grammar quite a bit, and makes it possible to turn certain non-terminal rules in the KDL spec's grammar into terminals.
The definitions of terminals are not included on this page, but they all map onto regular expressions.
Some of the non-terminals in the KDL spec's grammar are implemented as terminals in this grammar.
For example, we've done so in the numbers where JavaScript's parseInt
and parseFloat
can do a lot of heavy lifting if we just take the whole number string and remove any underscores and leading 0x
/0b
/0o
.
The diagrams on this page are rendered using railroad-diagrams, a lovely library by Tab Atkins Jr, who happens to be involved in KDL too!
document
In the KDL spec all line-space
are used as line-space*
so to simplify this grammar, the line-space
itself takes care of the "zero or more" part.
The document
non-terminal is used in node-children
below, because distinguishing between node
and final-node
as defined in the KDL spec is impossible without unbounded look-ahead.
Instead, the document
non-terminal is modified to support ending on a base-node
with or without node-terminator
.
There's one downside to this rewritten non-terminal: It works in the LL(1) parser but I am unable to write it down in BNF or any derivative. If someone else has any idea, feel free to make the necessary changes!
plain-node-space
The KDL spec's plain-node-space
is always used with either the +
or *
modifier.
Instead of doing the same, the plain-node-space
in this grammar is it's own +
, so it's either used plain for +
or marked optional for *
.
line-space
Compared to the line-space
defined in the KDL spec, this version includes its own "zero or more" operator.
node-space
The optional-node-space
and required-node-space
non-terminals defined in the KDL spec are combined into a single non-terminal.
This node-space
non-terminal does one extra thing that isn't shown in the diagram: it remembers whether the last subrule it applied was a plain-node-space
.
base-node
The node-prop-or-arg
and node-children
paths are only allowed if the last consumed node-space
ended with a plain-node-space
.
Note node-prop-or-arg
always ends on a node-space
.
node
node-prop-or-arg
The node-prop-or-arg
non-terminal is very different from its sibling in the KDL spec in order to remove any need for look-ahead:
- If the first token is a tag (called type in the spec), then it must be an argument
- If the first token is a number or a keyword, then it must be an argument
- If the first token is a string, then we need to check if there's an equals sign.
Looking for the equals sign required unbounded lookahead thanks to the allowed node-space
between the property name and the equals sign.
By changing this non-terminal so it also consumes any node-space
that comes after the property or argument, we can remove the need for the look-ahead.
node-children
node-terminator
The document
non-terminal supports ending on a node without node-terminator
, so this non-terminal doesn't need to include EOF.
tag
escline
multiline-comment
single-line-comment
value
keyword
The keyword
non-terminal "allows" invalid keywords, in order to throw an error when such a token is encountered.
This gives us the opportunity to throw a useful error along the lines of "Did you forget to mark a keyword with #?" rather than "Unexpected token InvalidKeyword, expected one of <list of 20 allowed tokens>"
number
string
The parser itself is greatly simplified and very lenient when it comes to strings, with post-processing added to filter out invalid strings. All multiline quoted and raw strings are post-processed to remove leading whitespace. All quoted strings are post-processed to replace any escapes.