Ignore whitespaces in the grammar #301
Comments
|
Note: related issues were discussed earlier in #56. Essentially, handling whitespaces is very easy in CF grammar-based parsers such as plain ANTLR and Xtext (which uses ANTLR 3 internally). However, it is not trivial in PEG-based tools such as Parboiled. The Handling Whitespace wikipage of the parboiled project says this:
|
|
Thank you @szarnyasg for the reference and raising the point of PEGs. However, PEGs usually do a greedy matching, so keeping whitespaces in the grammar and dropping (hiding) them in the ANTLR4 artifact:
<repeat min="1">
<non-terminal ref="ReadingClause"/> &WS;
</repeat> |
|
I think this makes a lot of sense, and will certainly simplify the grammar as a whole. I don't think I can foresee what exact consequences (if any) this will have for the language, but I can't come up with any real problem, if the rule was formulated that whitespace is allowed between tokens in any position of the query. Some examples that I've considered that don't seem to be a problem:
(EDIT: GitHub unhelpfully renders my multi-whitespace separations as single-whitespace in the above. You'll have to imagine wider separation, although it doesn't alter the point really.) As correctly identified by @jmarton in the above, the specification for whitespace as currently done is to model this very rule, but it turns out it's pretty tricky to get it right. @thobe Do you have any input for this discussion? |
|
Looks like I'm seeing this quite late. I think what we should do is add a classification of non-terminals whether they are expected to be handled on a lexer level or on a parser level. A processing tool can then use that information to inject whitespace in appropriate places for output formats that needs to be explicit about it. |
|
I agree, and we already have started such a scheme: openCypher/grammar/basic-grammar.xml Lines 546 to 559 in 64ec11f |
|
Yes. I'm not sure I'm a fan of the particular naming, but it does the trick. "All" we need to do is use that for some sort of "whitespace injection", and we're set. |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

Currently, the openCypher grammar has whitespaces modeled like the snippet below describing that
ReadPartis a sequence of zero or moreReadingClauseseparated by whitespaces.openCypher/grammar/cypher.xml
Lines 151 to 155 in 8fa5658
To be more precise, the excerpt above also states that a
ReadPartends with a whitespace unless empty.However, the next excerpt from
ReadUpdateEndis formalized more strictly to describe a sequence of one ore more ofReadingClauses that have whitespace separators strictly in between them, but not at the end.openCypher/grammar/cypher.xml
Lines 111 to 114 in 8fa5658
The reason behind the difference between the two excerpts is to make the grammar unambiguous also for whitespace matches.
If we could switch to whitespace-independent grammar, then the excerpt above could be rewritten in a more straightforward way like:
I think it would be beneficial to switch to whitespace-independent formalization of the grammar. Also the patches in #300 could be rewritten in a more intuitive way then.
ReadingClausesequence ofReadUpdateEndin its natural form despite the grammar says it is one followed by zero or more.What do you think?
The text was updated successfully, but these errors were encountered: