How to parse different number types with LALR(1)

Question

Consider a LALR(1) parser for a file format that allows integer numbers and floating point numbers.

As usual, something like 42 shall be a valid integer and a valid float (with some automagic conversion in the background).

There might be parsing rules where a floating point number or an integer number is expected, and other rules where only an integer number is expected, e.g.:

foo1
    : bar FLOAT buzz
    | bar INT buzz
    ;

foo2
    :  some INT other stuff
    ;

Now consider something like

foo3
    : bar FLOAT xyz FLOAT abc FLOAT buzz
    ;

but at each position in this rule, instead of FLOAT, also INT shall be allowed.

Turning this rule into 8 rules (one rule for each combination of FLOAT and INT) isn’t an option. (Consider a rule having 4 or 5 numbers...)
Using a rule like
```
float_or_int : FLOAT | INT;
```
won’t help, because in general, this rule will reduce all INT to float_or_int, and rules like foo2 no longer can be parsed. (Because with a grammar large enough, the one token lookahead cannot avoid the shift-reduce-conflicts resulting from this rule.)
When the lexer sees a number without a decimal point, it cannot decide whether the parser currently expects an int or a float-or-int.

How can this be handled in an elegant way?

Could you present a practical example of the issue you are fearing with float_or_int : FLOAT | INT ? I've used that in the past without meeting them in practice and in the artificial examples I can build, the ambiguity is fundamental. — AProgrammer
– AProgrammer, Commented Aug 8, 2014 at 9:25
@AProgrammer Take the foo1 and foo2 rules from above, add a main : foo1 | foo2;, and add rules bar : INT;, buzz : INT; etc. That’s of course (very) artificial, but in my actual project, which is larger and which I definitely cannot post here, such a rule will create a lot of shift-reduce conflicts. And I already had the same with at least two other parsers. It (of course) depends on the tokens that can follow INT or FLOAT, and if you design a language from scratch, you can take that into account. But when parsing some existing stuff, I made the experience there will be conflicts. — Martin
– Martin, Commented Aug 8, 2014 at 9:45
@AProgrammer The point is that I’d really like to use a float_or_int rule, but the file format to be parsed doesn’t allow it and I cannot change the file format. — Martin
– Martin, Commented Aug 8, 2014 at 10:32

John R. Strohm · Accepted Answer · 2014-08-08 12:04:02Z

3

What is typically done is that the numeric constant is "parsed" in the lexer, with "number type" (int, float, base, ...) information made available to the parser. You use the simple int_or_float rules in the grammar, and then the associated semantic actions are responsible for verifying that you have legal number types in each place, and declaring an error if you don't.

The parser will apparently succeed in parsing the file, but you will still have flagged the errors, and you can refuse to generate the result based on those errors.

answered Aug 8, 2014 at 12:04

John R. Strohm

18.2k6 gold badges49 silver badges56 bronze badges

That is, one single token for all number types?

Martin
– Martin

2014-08-08 12:07:32 +00:00
Commented Aug 8, 2014 at 12:07

Add a comment |

Stack Exchange Network

How to parse different number types with LALR(1)

1 Answer 1

Hot Network Questions

How to parse different number types with LALR(1)

1 Answer 1

Related

Hot Network Questions