3

The following lex rules are copied from int_literal in docs of rust nightly.

INTEGER_LITERAL ->
    ( DEC_LITERAL | BIN_LITERAL | OCT_LITERAL | HEX_LITERAL ) SUFFIX_NO_E?

DEC_LITERAL -> DEC_DIGIT (DEC_DIGIT|`_`)*

BIN_LITERAL -> `0b` (BIN_DIGIT|`_`)* BIN_DIGIT (BIN_DIGIT|`_`)*

OCT_LITERAL -> `0o` (OCT_DIGIT|`_`)* OCT_DIGIT (OCT_DIGIT|`_`)*

HEX_LITERAL -> `0x` (HEX_DIGIT|`_`)* HEX_DIGIT (HEX_DIGIT|`_`)*

BIN_DIGIT -> [`0`-`1`]

OCT_DIGIT -> [`0`-`7`]

DEC_DIGIT -> [`0`-`9`]

HEX_DIGIT -> [`0`-`9` `a`-`f` `A`-`F`]

SUFFIX -> IDENTIFIER_OR_KEYWORD

SUFFIX_NO_E -> SUFFIX _not beginning with `e` or `E`_

My problem is, how does Rust lexer handle integer literals like 0xabc?

If the lexer does not check whether the suffix is among those i32, u32, isize, usize, etc., how can it tell the difference between DEC_LITERAL + SUFFIX_NO_E and HEX_LITERAL?

For example, although 0xabc are commonly regarded as a hex literal, it can also be lexed into 0 + xabc with xabc a suffix and 0 a DEC_LITERAL.

I'm working as TA on a project, which require students to write a rust-like compiler in any language, with the target assembly language rv32i. But I have no idea how true rust compiler handles this situation.

2
  • Well, this says that a numeric literal with suffixes other than i32, u32... and friends are rejected when interpreted as literal expression Commented Aug 15 at 13:39
  • 1
    This is probably a decent starting point: github.com/rust-lang/rust/blob/…. The docs say that even 1f32 is classified as an int initially, and the fact that it's a float is determined later. That should help to answer your question about 0xabc Commented Aug 16 at 1:43

1 Answer 1

2

The Rust lexer doesn't use the grammar you copied as it is traditionally written in flex or similar programs. Instead it uses hand-crafted code to parse the characters one by one. More work but it allows for better error messages and more control about ambiguities.

They use the grammar in the documentation, though, but since it is to be read by humans, not compilers, sometimes they are a bit ambiguous, as it is the case you signal.

The parsing of a numeric literal is done in this function, let me copy the relevant code here:

    fn number(&mut self, first_digit: char) -> LiteralKind {
        debug_assert!('0' <= self.prev() && self.prev() <= '9');
        let mut base = Base::Decimal;
        if first_digit == '0' {
            // Attempt to parse encoding base.
            match self.first() {
                // ...
                'x' => {
                    base = Base::Hexadecimal;
                    // ...
                }
               // Not a base prefix; consume additional digits.
                '0'..='9' | '_' => {
                    // ...
                }
                // Also not a base prefix; nothing more to do here.
                '.' | 'e' | 'E' => {}
                // ...

You could see it as if the lexer is reading sub-tokens 0x and abc and joining them to form the final token.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.