Skip to main content
16 votes
Accepted

How would you test a lexer?

Your grammar probably has some rules for each token on how it can be produced (for example, that a { signifies a BLOCK_START token, or that a string-literal token is delimited by " characters). ...
Bart van Ingen Schenau's user avatar
16 votes

How would you test a lexer?

If you're writing the lexer yourself, this seems like an ideal case for test-driven development. While “the number of combinations of tokens in a source file can be huge,” the number of branches in ...
Arseni Mourzenko's user avatar
15 votes

Rewrite or Transpiler - How to move away from a proprietary SAAS solution

Let's me take on your issues one-by-one: the current implementation has bugs Yep, and when you transpile such a code, what makes you think those bugs will not be transpiled as well? With a rewrite, ...
Doc Brown's user avatar
  • 220k
11 votes
Accepted

Should my lexer allow what is obviously a syntax error?

Your lexer is never going to be able to diagnose all syntax errors unless you make it as powerful as the parser itself. This would be a large and totally unnecessary amount of work, and the only ...
Kilian Foth's user avatar
8 votes

What should be the datatype of the tokens a lexer returns to its parser?

As said in the title, which data type should a lexer return/give the parser? "Token", obviously. A lexer produces a stream of tokens, so it should return a stream of tokens. He mentioned Flex, a ...
Eric Lippert's user avatar
  • 46.6k
7 votes

How would you test a lexer?

One alternative that others aren't mentioning, is to use a test generative approach—like QuickCheck from Haskell—to generate the edge cases from the grammar you've defined. Now, once they're generated,...
A T's user avatar
  • 761
6 votes

Rewrite or Transpiler - How to move away from a proprietary SAAS solution

Obviously you ask "rewrite or transpile?", but I'm not clear what the underlying issue is. You mention the existing implementation is riddled with bugs. You mention that there is limited ...
Steve's user avatar
  • 12.6k
4 votes
Accepted

Do lexers have to go word by word or can they go line by line

If you have a grammar, then that should be your guide.  Going line-by-line is reasonable in a grammar and would simply include newlines in the grammar as starting or finishing syntactic constructs (...
Erik Eidt's user avatar
  • 34.8k
4 votes
Accepted

Is it a good idea to let keywords have different lexical rules from names of types, variables, functions, etc?

Distinguishing keywords/operators from user-defined names is not strictly necessary. Scannerless parsers can do just fine regardless. For example, it would be feasible to define a language where the ...
amon's user avatar
  • 136k
3 votes

Do any programming languages let you use other languages without restriction within them?

Not really. Language interop is a difficult problem, and language embedding even more so. Many languages have nontrivial syntax constructs that cannot be easily parsed by a general purpose parser. ...
amon's user avatar
  • 136k
2 votes

Should my lexer allow what is obviously a syntax error?

First a caveat: It very much depends on which subset of HTML. HTML5 does not really have the concept of errors at all. Basically any sequence of characters is valid and have a defined parse. I will ...
JacquesB's user avatar
  • 62.4k
2 votes
Accepted

How is it possible to store the AST nodes location in the source code?

Yes, you have described the standard approach. Creating a raw text node type which has a line number, and then having others inherit from that might be attractive. Error messages will typically want ...
J_H's user avatar
  • 7,891
2 votes
Accepted

How does a lexer handle template strings?

Two typical solutions: give up on using a separate lexer. This is easy and efficient with top-down parsing approaches such as recursive descent, PEG, or parser combinators. Such an approach makes it ...
amon's user avatar
  • 136k
2 votes

How Should Lexers Be Stateful?

It depends on the kind of languages and it depends on whether you see statefulness on the input or the output side of the lexer. On the input side, lexers are often stateful: If you parse a string ...
Christophe's user avatar
  • 82.2k
2 votes

Rewrite or Transpiler - How to move away from a proprietary SAAS solution

Why the whole hog? You have listed numerous issues that make wholesale change risky: Novice Staff Tight Budgets Third Party Supporters (misaligned goals) Buggy/Weird behaviors Lack of tests From the ...
Kain0_0's user avatar
  • 16.6k
1 vote

Rewrite or Transpiler - How to move away from a proprietary SAAS solution

Transpiler or rewrite? Just convert it. Easily 80% of my entire career has been doing this. You get one requirement: make it do what it did before on this new system. It isn't a rewrite and you don't ...
candied_orange's user avatar
1 vote
Accepted

Concatenating strings given a BNF grammar

A production rule for an empty string (or equivalently an empty token) can always succeed by consuming nothing from the input. So, when you peek '(', the parser first tries the production <Letter&...
Bart van Ingen Schenau's user avatar
1 vote

How Should Lexers Be Stateful?

They should not be stateful. No mutations rightfully belong in a lexer. All you're doing is transforming one stream (usually characters) into another (usually strings). That sort of thing is best ...
Telastyn's user avatar
  • 110k
1 vote

Do lexers have to go word by word or can they go line by line

"It depends." In early languages such as the original FORTRAN, and some COBOLs, which assumed that input would be provided on 80-column punched cards, we have the notion of a continuation ...
Mike Robinson's user avatar
1 vote

Should my lexer allow what is obviously a syntax error?

Your lexer should catch syntax error for malformed tokens, and this solely. But in general, your tokens should be complex enough to avoid to return tokens which sole purpose is to delimit other token ...
Diane M's user avatar
  • 2,116
1 vote

Should my lexer allow what is obviously a syntax error?

It depends on the scope of your target language and your use cases. If you want to consider </b> (but not </ b>) as a keyword then the lexer can identify that as such. </b would ...
Telastyn's user avatar
  • 110k
1 vote

Differences between enumeration-based and hierarchical token typing

An enumeration is the classic/C-ish way to define tokens, but that doesn't make it extraordinarily good – precisely because it is difficult to keep track of associated values. A token might contain ...
amon's user avatar
  • 136k

Only top scored, non community-wiki answers of a minimum length are eligible