Skip to main content
added 8 characters in body
Source Link
user39685
user39685

Regular expressions vs. framework (in the sense of being separate from a language) is a false dichotomy. Example Counter-example: parser combinators. These can easily handle regular, context-free, and context-sensitive languages; nice libraries exist for them in many languages. Major advantages are that language integration allows them to snarf their host languages's features (such as type system, unit-testing framework, using classes/objects/functions to extend them, composability), and fewer limitations such as token lookahead; major disadvantages are that an implementation is tied to a specific language (duh), decent error messages are often difficult to generate, and they can easily be inefficient if care isn't taken.

Here are the factors that I look at when deciding what tool to use to parse something:

  • level of the language, and of the parsing tool, on the Chomsky hierarchy. Is it a regular language? Is it context-free? Is it context-sensitive? If it's context-sensitive, I'd better make sure my parsing tool can handle it.

  • testability of parsers, including those for sub-rules. It's pretty easy to build parsers that look right but are totally wrong. Being able to independently test sub-parsers makes it far, far easier to get the whole parser right. This goes hand-in-hand with composability: building big parsers by putting small parsers together.

  • whether the language is ambiguous. Some languages are ambiguous, and some tools can't deal with them. If there are multiple valid parses, I probably want to be notified of that, possibly by an error or multiple results, instead of just getting a single parse and thinking it's the only one.

  • control over parse results. What output do parsers produce -- a concrete parse tree? Are the nodes strongly typed or stringly typed? I find it very useful to have some control over how the parse tree is built, as it is being built; having to post-process the tree is often more painful and error-prone, in my experience. An example is an integer parser -- should it return an integer or a string?

  • efficiency. I don't know a whole lot about this, but there are definitely differences between different approaches, especially when there's a lot of backtracking.

  • error reporting. At the very least, parse errors should be reportable with position and current rule information; a trace can also be useful but I haven't really seen any decent solutions to this problem (although I bet there are some).

  • expressivity of the parsing tool. Does it make me jump through hoops to express simple things like a sequence of values separated by commas, with no trailing comma? I'm a big fan of BNF, but not being able to extend it is frustrating, and can make grammars look far more complicated, and thus harder to maintain, than they actually need to. Being able to capture patterns and abstract over them shouldn't be a luxury, but a requirement.

Regular expressions vs. framework (in the sense of being separate from a language) is a false dichotomy. Example: parser combinators. These can easily handle regular, context-free, and context-sensitive languages; nice libraries exist for them in many languages. Major advantages are that language integration allows them to snarf their host languages's features (such as type system, unit-testing framework, using classes/objects/functions to extend them, composability), and fewer limitations such as token lookahead; major disadvantages are that an implementation is tied to a specific language (duh), decent error messages are often difficult to generate, and they can easily be inefficient if care isn't taken.

Here are the factors that I look at when deciding what tool to use to parse something:

  • level of the language, and of the parsing tool, on the Chomsky hierarchy. Is it a regular language? Is it context-free? Is it context-sensitive? If it's context-sensitive, I'd better make sure my parsing tool can handle it.

  • testability of parsers, including those for sub-rules. It's pretty easy to build parsers that look right but are totally wrong. Being able to independently test sub-parsers makes it far, far easier to get the whole parser right. This goes hand-in-hand with composability: building big parsers by putting small parsers together.

  • whether the language is ambiguous. Some languages are ambiguous, and some tools can't deal with them. If there are multiple valid parses, I probably want to be notified of that, possibly by an error or multiple results, instead of just getting a single parse and thinking it's the only one.

  • control over parse results. What output do parsers produce -- a concrete parse tree? Are the nodes strongly typed or stringly typed? I find it very useful to have some control over how the parse tree is built, as it is being built; having to post-process the tree is often more painful and error-prone, in my experience. An example is an integer parser -- should it return an integer or a string?

  • efficiency. I don't know a whole lot about this, but there are definitely differences between different approaches, especially when there's a lot of backtracking.

  • error reporting. At the very least, parse errors should be reportable with position and current rule information; a trace can also be useful but I haven't really seen any decent solutions to this problem (although I bet there are some).

  • expressivity of the parsing tool. Does it make me jump through hoops to express simple things like a sequence of values separated by commas, with no trailing comma? I'm a big fan of BNF, but not being able to extend it is frustrating, and can make grammars look far more complicated, and thus harder to maintain, than they actually need to. Being able to capture patterns and abstract over them shouldn't be a luxury, but a requirement.

Regular expressions vs. framework (in the sense of being separate from a language) is a false dichotomy. Counter-example: parser combinators. These can easily handle regular, context-free, and context-sensitive languages; nice libraries exist for them in many languages. Major advantages are that language integration allows them to snarf their host languages's features (such as type system, unit-testing framework, using classes/objects/functions to extend them, composability), and fewer limitations such as token lookahead; major disadvantages are that an implementation is tied to a specific language (duh), decent error messages are often difficult to generate, and they can easily be inefficient if care isn't taken.

Here are the factors that I look at when deciding what tool to use to parse something:

  • level of the language, and of the parsing tool, on the Chomsky hierarchy. Is it a regular language? Is it context-free? Is it context-sensitive? If it's context-sensitive, I'd better make sure my parsing tool can handle it.

  • testability of parsers, including those for sub-rules. It's pretty easy to build parsers that look right but are totally wrong. Being able to independently test sub-parsers makes it far, far easier to get the whole parser right. This goes hand-in-hand with composability: building big parsers by putting small parsers together.

  • whether the language is ambiguous. Some languages are ambiguous, and some tools can't deal with them. If there are multiple valid parses, I probably want to be notified of that, possibly by an error or multiple results, instead of just getting a single parse and thinking it's the only one.

  • control over parse results. What output do parsers produce -- a concrete parse tree? Are the nodes strongly typed or stringly typed? I find it very useful to have some control over how the parse tree is built, as it is being built; having to post-process the tree is often more painful and error-prone, in my experience. An example is an integer parser -- should it return an integer or a string?

  • efficiency. I don't know a whole lot about this, but there are definitely differences between different approaches, especially when there's a lot of backtracking.

  • error reporting. At the very least, parse errors should be reportable with position and current rule information; a trace can also be useful but I haven't really seen any decent solutions to this problem (although I bet there are some).

  • expressivity of the parsing tool. Does it make me jump through hoops to express simple things like a sequence of values separated by commas, with no trailing comma? I'm a big fan of BNF, but not being able to extend it is frustrating, and can make grammars look far more complicated, and thus harder to maintain, than they actually need to. Being able to capture patterns and abstract over them shouldn't be a luxury, but a requirement.

Source Link
user39685
user39685

Regular expressions vs. framework (in the sense of being separate from a language) is a false dichotomy. Example: parser combinators. These can easily handle regular, context-free, and context-sensitive languages; nice libraries exist for them in many languages. Major advantages are that language integration allows them to snarf their host languages's features (such as type system, unit-testing framework, using classes/objects/functions to extend them, composability), and fewer limitations such as token lookahead; major disadvantages are that an implementation is tied to a specific language (duh), decent error messages are often difficult to generate, and they can easily be inefficient if care isn't taken.

Here are the factors that I look at when deciding what tool to use to parse something:

  • level of the language, and of the parsing tool, on the Chomsky hierarchy. Is it a regular language? Is it context-free? Is it context-sensitive? If it's context-sensitive, I'd better make sure my parsing tool can handle it.

  • testability of parsers, including those for sub-rules. It's pretty easy to build parsers that look right but are totally wrong. Being able to independently test sub-parsers makes it far, far easier to get the whole parser right. This goes hand-in-hand with composability: building big parsers by putting small parsers together.

  • whether the language is ambiguous. Some languages are ambiguous, and some tools can't deal with them. If there are multiple valid parses, I probably want to be notified of that, possibly by an error or multiple results, instead of just getting a single parse and thinking it's the only one.

  • control over parse results. What output do parsers produce -- a concrete parse tree? Are the nodes strongly typed or stringly typed? I find it very useful to have some control over how the parse tree is built, as it is being built; having to post-process the tree is often more painful and error-prone, in my experience. An example is an integer parser -- should it return an integer or a string?

  • efficiency. I don't know a whole lot about this, but there are definitely differences between different approaches, especially when there's a lot of backtracking.

  • error reporting. At the very least, parse errors should be reportable with position and current rule information; a trace can also be useful but I haven't really seen any decent solutions to this problem (although I bet there are some).

  • expressivity of the parsing tool. Does it make me jump through hoops to express simple things like a sequence of values separated by commas, with no trailing comma? I'm a big fan of BNF, but not being able to extend it is frustrating, and can make grammars look far more complicated, and thus harder to maintain, than they actually need to. Being able to capture patterns and abstract over them shouldn't be a luxury, but a requirement.