flatparse
 

flatparse is a high-performance parsing library, focusing on programming languages and human-readable data formats. The "flat" in the name
refers to the ByteString parsing input, which has pinned contiguous data, and also to the library internals, which avoids indirections and heap allocations
whenever possible.
Features and non-features
- Excellent performance. On microbenchmarks, flatparseis around 10 times faster thanattoparsecormegaparsec. On larger examples with heavier use of source positions and spans and/or indentation parsing, the performance difference grows to 20-30 times. Compile times and exectuable sizes are also significantly better withflatparsethan withmegaparsecorattoparsec.flatparseinternals make liberal use of unboxed tuples and GHC primops. As a result, pure validators (parsers returning()) inflatparseare not difficult to implement with zero heap allocation.
- No incremental parsing, and only strict ByteStringis supported as input. However, it can be still useful to convert fromText,Stringor other types toByteString, and then useflatparsefor parsing, sinceflatparseperformance usually more than makes up for the conversion costs.
- Only little-endian 64 bit systems are currently supported. This may change in the future. Getting good performance requires architecture-specific optimizations; I've only considered the most common setting at this point.
- Support for fast source location handling, indentation parsing and informative error messages. flatparseprovides a low-level interface to these. Batteries are not included, but it should be possible for users to build custom solutions, which are more sophisticated, but still as fast as possible. In my experience, the included batteries in other libraries often come with major unavoidable overheads, and often we still have to extend existing machinery in order to scale to production features.
- The backtracking model of flatparseis different to parsec libraries, and is more close to the nom library in Rust. The idea is that parser failure is distinguished from parsing error. The former is used for control flow, and we can backtrack from it. The latter is used for unrecoverable errors, and by default it's propagated to the top.flatparsedoes not track whether parsers have consumed inputs. In my experience, what we really care about is the failure/error distinction, and inparsecormegaparsecthe consumed/non-consumed separation is often muddled and discarded in larger parser implementations. By default, basicflatparseparsers can fail but can not throw errors, with the exception of the specifically error-throwing operations. Hence,flatparseusers have to be mindful about grammar, and explicitly insert errors where it is known that the input can't be valid.
flatparse comes in two flavors: FlatParse.Basic and FlatParse.Stateful. Both support a custom error type.
- FlatParse.Basiconly supports the above features. If you don't need indentation
parsing, this is sufficient.
- FlatParse.Statefuladditionally supports a built-in- Intworth of internal state
and an additional- Intreader environment. This can support a wide range of indentation parsing
features. There is a slight overhead in performance and code size compared to- Basic. However, in
small parsers and microbenchmarks the difference between- Basicand- Statefulis often reduced
to near zero by GHC and/or LLVM optimization.
Tutorial
Informative tutorials are work in progress. See src/FlatParse/Examples
for a lexer/parser example with acceptably good error messages.
Contribution
Pull requests are welcome. I'm fairly quick to add PR authors as collaborators.
Some benchmarks
Execution times below. See source code in bench. Compiled with GHC 8.8.4 -O2 -fllvm.
| benchmark | runtime | 
| sexp/fpbasic | 3.345 ms | 
| sexp/fpstateful | 3.441 ms | 
| sexp/bytesmith | 5.646 ms | 
| sexp/attoparsec | 43.58 ms | 
| sexp/megaparsec | 57.76 ms | 
| sexp/parsec | 182.4 ms | 
| long keyword/fpbasic | 306.1 μs | 
| long keyword/fpstateful | 220.3 μs | 
| long keyword/bytesmith | 1.707 ms | 
| long keyword/attoparsec | 5.420 ms | 
| long keyword/megaparsec | 3.605 ms | 
| long keyword/parsec | 50.10 ms | 
| numeral csv/fpbasic | 898.4 μs | 
| numeral csv/fpstateful | 868.3 μs | 
| numeral csv/bytesmith | 2.412 ms | 
| numeral csv/attoparsec | 21.30 ms | 
| numeral csv/megaparsec | 10.37 ms | 
| numeral csv/parsec | 78.16 ms | 
Object file sizes for each module containing the s-exp, long keyword and numeral csv benchmarks.
| library | object file size (bytes) | 
| fpbasic | 26456 | 
| fpstateful | 30008 | 
| bytesmith | 39240 | 
| attoparsec | 83288 | 
| megaparsec | 188696 | 
| parsec | 75880 |