Skip to main content
reorder
Source Link
Michael Homer
  • 15.3k
  • 3
  • 37
  • 101

None of this matters for tokenisation. Groups in regular expressions are not real. Both the parenthesisation and the limited repetition syntax is just a shorthand for generating the (N)DFA. In this case, the minimal DFA you can produce from that is: The DFA for the language in the question

which is exactly the language that would be created by (abc)(abc)*.

The groups within the expression as you wrote it are not part of tokenisation: if you need those groups to have meaning, what you have is a parsing problem, not a tokenisation problem, and you can apply all of the usual parsing techniques to get what you want. If the totality of the string is what matters, then it doesn't matter how you determine what matched, just that you get the correct string in the end.


It is possible to produce this minimal automaton where it's obvious when to stop a few ways, most straightforwardly (but not efficiently) using the double-reverse approach: produce a DFA, reverse all the arrows to produce an NFA, convert that to a DFA, then do the same again. It isn't typical to do so because it's an expensive operation to perform and can result in exponential blowup of states, but if you need to make one then you can. TheHowever, none of this is necessary: the standard non-minimised DFA or NFA will already give you exactly the same matching power, and you want to use greedy matching if that's an opt-in thing foraccept exactly the engine you're usingsame strings.

You can explore and reverse this expression interactively using this tool to see that process in action. You'll see that the entire string is consumed regardless, and that's a successful match.


When used for tokenisation, it is typical to follow the "longest-token rule" or "maximal munch" and always take the longest possible match at the tokenisation point. This is necessary even for simple cases: an "identifier" pattern [a-z]+ would only match single characters otherwise, but presumably should take all of them. In the example above, this means looping until you reach the acceptor (double circle) state and cannot take the outgoing a edge from it because the next symbol is not an "a".

The groups within the In an off-the-shelf regular expression as you wrote it are not part of tokenisation: if you need those groups to have meaning, what you have is a parsing problem, not a tokenisation problemengine, andwhich you can apply all of the usual parsing techniques to get whatshould probably be using, you want. If the totality of to opt into greedy matching if that's required for the string is what matters, then it doesn't matter how you determine what matchedengine you're using, just that you get the correct string in the endbut lexer systems typically do this automatically.

None of this matters for tokenisation. Groups in regular expressions are not real. Both the parenthesisation and the limited repetition syntax is just a shorthand for generating the (N)DFA. In this case, the minimal DFA you can produce from that is: The DFA for the language in the question

which is exactly the language that would be created by (abc)(abc)*.

It is possible to produce this minimal automaton a few ways, most straightforwardly (but not efficiently) using the double-reverse approach: produce a DFA, reverse all the arrows to produce an NFA, convert that to a DFA, then do the same again. It isn't typical to do so because it's an expensive operation to perform and can result in exponential blowup of states, but if you need to make one then you can. The standard non-minimised DFA will give you exactly the same matching power, and you want to use greedy matching if that's an opt-in thing for the engine you're using.

You can explore and reverse this expression interactively using this tool to see that process in action. You'll see that the entire string is consumed regardless, and that's a successful match.


When used for tokenisation, it is typical to follow the "longest-token rule" or "maximal munch" and always take the longest possible match at the tokenisation point. This is necessary even for simple cases: an "identifier" pattern [a-z]+ would only match single characters otherwise, but presumably should take all of them. In the example above, this means looping until you reach the acceptor (double circle) state and cannot take the outgoing a edge from it because the next symbol is not an "a".

The groups within the expression as you wrote it are not part of tokenisation: if you need those groups to have meaning, what you have is a parsing problem, not a tokenisation problem, and you can apply all of the usual parsing techniques to get what you want. If the totality of the string is what matters, then it doesn't matter how you determine what matched, just that you get the correct string in the end.

None of this matters for tokenisation. Groups in regular expressions are not real. Both the parenthesisation and the limited repetition syntax is just a shorthand for generating the (N)DFA. In this case, the minimal DFA you can produce from that is: The DFA for the language in the question

which is exactly the language that would be created by (abc)(abc)*.

The groups within the expression as you wrote it are not part of tokenisation: if you need those groups to have meaning, what you have is a parsing problem, not a tokenisation problem, and you can apply all of the usual parsing techniques to get what you want. If the totality of the string is what matters, then it doesn't matter how you determine what matched, just that you get the correct string in the end.


It is possible to produce this minimal automaton where it's obvious when to stop a few ways, most straightforwardly (but not efficiently) using the double-reverse approach: produce a DFA, reverse all the arrows to produce an NFA, convert that to a DFA, then do the same again. It isn't typical to do so because it's an expensive operation to perform and can result in exponential blowup of states, but if you need to make one then you can. However, none of this is necessary: the standard non-minimised DFA or NFA will already give you exactly the same matching power and accept exactly the same strings.

You can explore and reverse this expression interactively using this tool to see that process in action. You'll see that the entire string is consumed regardless, and that's a successful match.


When used for tokenisation, it is typical to follow the "longest-token rule" or "maximal munch" and always take the longest possible match at the tokenisation point. This is necessary even for simple cases: an "identifier" pattern [a-z]+ would only match single characters otherwise, but presumably should take all of them. In the example above, this means looping until you reach the acceptor (double circle) state and cannot take the outgoing a edge from it because the next symbol is not an "a". In an off-the-shelf regular expression engine, which you should probably be using, you want to opt into greedy matching if that's required for the engine you're using, but lexer systems typically do this automatically.

Source Link
Michael Homer
  • 15.3k
  • 3
  • 37
  • 101

None of this matters for tokenisation. Groups in regular expressions are not real. Both the parenthesisation and the limited repetition syntax is just a shorthand for generating the (N)DFA. In this case, the minimal DFA you can produce from that is: The DFA for the language in the question

which is exactly the language that would be created by (abc)(abc)*.

It is possible to produce this minimal automaton a few ways, most straightforwardly (but not efficiently) using the double-reverse approach: produce a DFA, reverse all the arrows to produce an NFA, convert that to a DFA, then do the same again. It isn't typical to do so because it's an expensive operation to perform and can result in exponential blowup of states, but if you need to make one then you can. The standard non-minimised DFA will give you exactly the same matching power, and you want to use greedy matching if that's an opt-in thing for the engine you're using.

You can explore and reverse this expression interactively using this tool to see that process in action. You'll see that the entire string is consumed regardless, and that's a successful match.


When used for tokenisation, it is typical to follow the "longest-token rule" or "maximal munch" and always take the longest possible match at the tokenisation point. This is necessary even for simple cases: an "identifier" pattern [a-z]+ would only match single characters otherwise, but presumably should take all of them. In the example above, this means looping until you reach the acceptor (double circle) state and cannot take the outgoing a edge from it because the next symbol is not an "a".

The groups within the expression as you wrote it are not part of tokenisation: if you need those groups to have meaning, what you have is a parsing problem, not a tokenisation problem, and you can apply all of the usual parsing techniques to get what you want. If the totality of the string is what matters, then it doesn't matter how you determine what matched, just that you get the correct string in the end.