Ignore whitespaces in the grammar #301

jmarton · 2018-02-13T09:13:31Z

Currently, the openCypher grammar has whitespaces modeled like the snippet below describing that ReadPart is a sequence of zero or more ReadingClause separated by whitespaces.

openCypher/grammar/cypher.xml

Lines 151 to 155 in 8fa5658

    
           <production name="ReadPart" rr:inline="true"> 
        
             <repeat> 
        
               <non-terminal ref="ReadingClause"/> &WS; 
        
             </repeat> 
        
           </production>

To be more precise, the excerpt above also states that a ReadPart ends with a whitespace unless empty.

However, the next excerpt from ReadUpdateEnd is formalized more strictly to describe a sequence of one ore more of ReadingClauses that have whitespace separators strictly in between them, but not at the end.

openCypher/grammar/cypher.xml

Lines 111 to 114 in 8fa5658

    
           <non-terminal ref="ReadingClause"/> 
        
           <repeat> 
        
             &WS; <non-terminal ref="ReadingClause"/> 
        
           </repeat>

The reason behind the difference between the two excerpts is to make the grammar unambiguous also for whitespace matches.

If we could switch to whitespace-independent grammar, then the excerpt above could be rewritten in a more straightforward way like:

<repeat min="1">
  <non-terminal ref="ReadingClause"/>
</repeat>

I think it would be beneficial to switch to whitespace-independent formalization of the grammar. Also the patches in #300 could be rewritten in a more intuitive way then.

the railroad diagram generator already seems to skip whitespaces and does the hard work to illustrate one or more ReadingClause sequence of ReadUpdateEnd in its natural form despite the grammar says it is one followed by zero or more.
the generated ANTLR4 grammar could be rewritten to skip whitespaces (at least in theory, but I didn't try to do so)
is there a use for whitespaces in the original XML grammar or in the EBNF grammar artifact or anywhere else?

What do you think?

The text was updated successfully, but these errors were encountered:

szarnyasg · 2018-02-14T18:43:57Z

Note: related issues were discussed earlier in #56. Essentially, handling whitespaces is very easy in CF grammar-based parsers such as plain ANTLR and Xtext (which uses ANTLR 3 internally).

However, it is not trivial in PEG-based tools such as Parboiled. The Handling Whitespace wikipage of the parboiled project says this:

One disadvantage of PEGs over lexer based CFGs can be the handling of white space. In a traditional CFG based parser with a separate lexer (scanner) phase this lexer might simply skip all white space and only generate tokens for the actual parser to operate on. This can free the actual parser grammar from all white space treatment. Since PEGs do not have a lexer but directly operate on the raw input they have to deal with white space in the grammar itself. Language designers with little experience in PEGs can sometime be unsure of how to best handle white space in their grammar.

jmarton · 2018-02-16T21:14:38Z

Thank you @szarnyasg for the reference and raising the point of PEGs.

However, PEGs usually do a greedy matching, so keeping whitespaces in the grammar and dropping (hiding) them in the ANTLR4 artifact:

might hide ambiguity for whitespace parsing, but greedy matching on whitespaces in PEGs will handle, i.e. for the ambiguous pattern SP* SP*, its first part will consume all whitespaces
the ANTLR4 tests will still raise ambiguity exceptions in case non-whitespace ambiguity is introduced
still allow for clarified repetition syntax similar to the one given in the issue description:

<repeat min="1">
  <non-terminal ref="ReadingClause"/> &WS;
</repeat>

Mats-SX · 2018-03-01T08:23:34Z

I think this makes a lot of sense, and will certainly simplify the grammar as a whole. I don't think I can foresee what exact consequences (if any) this will have for the language, but I can't come up with any real problem, if the rule was formulated that whitespace is allowed between tokens in any position of the query.

Some examples that I've considered that don't seem to be a problem:

function invocations: RETURN toInt ( '1' )
multi-word operators: RETURN 1 IS NOT NULL
patterns: MATCH ( a : A ) - [ r : TYPE | TYPE2 ] - > ( b : B { b : 1 } )

(EDIT: GitHub unhelpfully renders my multi-whitespace separations as single-whitespace in the above. You'll have to imagine wider separation, although it doesn't alter the point really.)

As correctly identified by @jmarton in the above, the specification for whitespace as currently done is to model this very rule, but it turns out it's pretty tricky to get it right.

@thobe Do you have any input for this discussion?

thobe · 2018-12-20T14:14:17Z

Looks like I'm seeing this quite late.

I think what we should do is add a classification of non-terminals whether they are expected to be handled on a lexer level or on a parser level. A processing tool can then use that information to inject whitespace in appropriate places for output formats that needs to be explicit about it.

Mats-SX · 2018-12-20T17:10:18Z

I agree, and we already have started such a scheme:

openCypher/grammar/basic-grammar.xml

Lines 546 to 559 in 64ec11f

    
           <production name="HexInteger" oc:lexer="true"> 
        
             <literal value="0x"/> 
        
             <repeat min="1"><non-terminal ref="HexDigit"/></repeat> 
        
           </production> 
        
           <production name="DecimalInteger" oc:lexer="true"> 
        
             <alt> 
        
               <non-terminal ref="ZeroDigit"/> 
        
               <seq> 
        
                 <non-terminal ref="NonZeroDigit"/> 
        
                 <repeat><non-terminal ref="Digit"/></repeat> 
        
               </seq> 
        
             </alt> 
        
           </production>

thobe · 2018-12-21T00:08:02Z

Yes. I'm not sure I'm a fan of the particular naming, but it does the trick. "All" we need to do is use that for some sort of "whitespace injection", and we're set.

Nov	DEC	Jan
	18
2019	2020	2021

opencypher / openCypher

Ignore whitespaces in the grammar #301

Ignore whitespaces in the grammar #301

jmarton commented Feb 13, 2018 •

edited

szarnyasg commented Feb 14, 2018

jmarton commented Feb 16, 2018

Mats-SX commented Mar 1, 2018 •

edited

thobe commented Dec 20, 2018

Mats-SX commented Dec 20, 2018

thobe commented Dec 21, 2018

opencypher / openCypher

Ignore whitespaces in the grammar #301

Ignore whitespaces in the grammar #301

Comments

jmarton commented Feb 13, 2018 • edited

szarnyasg commented Feb 14, 2018

jmarton commented Feb 16, 2018

Mats-SX commented Mar 1, 2018 • edited

thobe commented Dec 20, 2018

Mats-SX commented Dec 20, 2018

thobe commented Dec 21, 2018

jmarton commented Feb 13, 2018 •

edited

Mats-SX commented Mar 1, 2018 •

edited