DEV Community

Cover image for The Easiest String Parsing in Java
Dash One
Dash One

Posted on • Edited on

The Easiest String Parsing in Java

Parsing structured strings in Java has always been painful. Most developers reach for regular expressions, split(), or manual slicing. But these techniques are error-prone, hard to read, and most importantly—unsafe at compile time.

The StringFormat class makes parsing so easy that even a beginner can implement with a one-liner.


🧩 Regex is Not Easy

Take a common use case: parsing a file path like this:

/logs/2024/05/16/system.log
Enter fullscreen mode Exit fullscreen mode

You want to extract year, month, day, and filename. Here’s what most people do:

private static final Pattern LOG_PATH = Pattern.compile(
    "/logs/(\\d{4})/(\\d{2})/(\\d{2})/(.+)\\.log"
);

// elsewhere in code:
Matcher matcher = LOG_PATH.matcher(path);
if (matcher.matches()) {
    String year = matcher.group(1);
    String month = matcher.group(2);
    String day = matcher.group(3);
    String file = matcher.group(4);
}
Enter fullscreen mode Exit fullscreen mode

Some problems:

  1. Regex pattern readability sucks.
  2. Use group names and your pattern readability sucks even more.
  3. With pattern and extraction logic far apart, you could get the groups out of order.
  4. Group indices are magic numbers.

✅ Structured Parsing with StringFormat

Tthe same logic using StringFormat:

private static final StringFormat FORMAT =
    new StringFormat("/logs/{year}/{month}/{day}/{file}.log");

FORMAT.parseOrThrow(
    "/logs/2024/05/16/system.log",
    (year, month, day, file) ->
        new Log(parseInt(year), parseInt(month), parseInt(day), file);
Enter fullscreen mode Exit fullscreen mode

That’s it. No string math, no group numbers.


🛡️ Compile-time Safety

StringFormat performs compile-time validation of the lambda:

// ❌ Error: parameter count mismatch
new StringFormat("{a}-{b}")
    .parseOrThrow("1-2", (a, b, c) -> { });
//                       ~~~~~~~~~~~~~~~
//                       Compilation error: too many parameters
Enter fullscreen mode Exit fullscreen mode
// ❌ Error: parameter order mismatch
new StringFormat("{a}-{b}")
    .parseOrThrow("1-2", (b, a) -> { });
//                       ~~~~~~~
//                       Compilation error: expected order (a, b)
Enter fullscreen mode Exit fullscreen mode
// ✅ Correct: order matches field declaration
new StringFormat("{a}-{b}")
    .parseOrThrow("1-2", (a, b) -> ...);
Enter fullscreen mode Exit fullscreen mode

This level of safety is not possible with regex, split(), or ad-hoc parsing.

No need to “remember” group indices or keep documentation in sync with code—the compiler checks it for you.


✅ Scanning repeatedly

Say, if you have many such file paths in the input string, you can lazily scan them all:

List<Log> logs = FORMAT.scan(input, (year, month, day, file) -> ...)
    .filter(...)  // apply whatever filtering you care about
    .limit(10)    // if you want up to 10 such files
    .toList();
Enter fullscreen mode Exit fullscreen mode

✅ Regex is Unpredictable

Regex engines in Java (and many other platforms) use NFA-based backtracking, which means certain patterns (especially involving nested repetitions or ambiguous alternations) can cause catastrophic slowdowns.

Even simple-looking regexes like:

href="([^"]*)"
Enter fullscreen mode Exit fullscreen mode

or:

((a|b|c)+)+
Enter fullscreen mode Exit fullscreen mode

can trigger catastrophic backtracking when matched against malicious input like nested quotes or repeated characters. These aren’t toy examples—real-world systems have gone down because of them:

  • Stack Overflow 2016: a regex used to extract comment anchors caused a global outage due to backtracking explosion (postmortem).
  • Cloudflare 2019: a single WAF rule with a pathological regex caused CPU saturation and took down large parts of the network (incident report).

RE2, Google's safe regex engine, avoids this with DFA-based guarantees—but at the cost of expressive power (e.g., no backreferences or lookaround) and often slower than hand-written substring logic.


✅ StringFormat uses indexOf() calls

StringFormat avoids regex entirely. It splits the template into static fragments and uses a left-to-right scan driven by String.indexOf(...) to locate placeholder matches.

This gives several benefits:

  • Deterministic linear scanning
  • No backtracking
  • No regex syntax or escaping
  • Fast constant-time fragment matching (in practice, indexOf(...) is extremely fast for short needles)
  • Safe under adversarial input (no regex ReDoS risk)

In real workloads, where fragments are small (/, :, @, etc.) and inputs are well-structured, indexOf() outperforms regex by a wide margin—often many times faster per match.


🧠 Summary

If you're building tooling that parses structured text:

  • Don't use regex unless you need full pattern generality.
  • Don’t trust that it will behave the same under scale or attack.
  • Use simple substring matching where structure allows—like StringFormat does.

↔️ Bidirectional

While parsing is the main use case, the same format string also supports formatting:

String path = FORMAT.format(year, month, day, file);
Enter fullscreen mode Exit fullscreen mode

The format string is always the source of truth. No string concatenation. No misplaced .append() chains.

Similar compile-time protection exists: you'll get compilation error if you pass in wrong number of args, or for example get the file and year/month/day in the wrong order.

With traditional String.format("input size for %s is %s", user.id(), input.size()), it's not a good idea to stash away the format string as a constant because then it's easy to pass the format arguments in the wrong order.

But with StringFormat and its compile-time check, it's safe to do so, making it easier to reuse the same format string at different places.

format() uses direct string concatenation (+), which is faster than StringBuilder on Java 9 and higher.


🐍 Python's parse

Looking around, Python offers similar syntax in its parse library:

from parse import parse

result = parse("/logs/{year}/{month}/{day}/{file}.log", "/logs/2024/05/17/server.log")
print(result.named)  # {'year': '2024', 'month': '05', 'day': '17', 'file': 'server'}
Enter fullscreen mode Exit fullscreen mode

Except parse() offers no compile-time enforcements, and under the hood is a wrapper of regex, so suffers the same NFA-based regex performance overhead and potential disastrous backtracking.


🌍 Other Languages

Language Closest Equivalent Readability Compile-Time Safety Bidirectional Notes
Java StringFormat ✅ High ✅ Yes ✅ Yes Template-based parsing with lambda + type safety
Python parse ✅ High ❌ No ✅ Yes Clean syntax, runtime-only verification
JS/TS path-to-regexp, match() ⚠️ Medium ❌ No ⚠️ Partial URL-focused, lacks general structure matching
Go regexp (manual group extraction) ❌ Low ❌ No ❌ No Verbose and error-prone, no field names
C++ std::regex, manual parsing ❌ Low ❌ No ❌ No No built-in structure mapping; verbose
Kotlin Regex, manual destructuring ⚠️ Medium ❌ No ❌ No No declarative templates, no type checks

🔍 Observations

  • Python is the only language with syntax on par with StringFormat, but all verification is deferred to runtime.
  • JavaScript, Go, C++, and Kotlin rely on regex or ad-hoc logic without structural templates.
  • Only Java + StringFormat delivers template-based parsing with lambda field binding, and compiler-checked safety.


👉 Github Repo: Mug

Top comments (2)

Collapse
 
michelle_lei_7931f8c3717c profile image
Michelle Lei

Can StringFormat support automatic type validation and conversion like %d or Python parse's :d?

It seems parse can also support datetime parsing, like ti?

Collapse
 
fluentfuture profile image
Dash One • Edited

Yeah. You've hit a major design trade-off of StringFormat.

There were two considerations that prevented the addition of format specifiers:

No Backtracking

StringFormat is linear search (using String.indexOf()) due to its pattern simplicity.

If we had to support a custom pattern language like datetime, particularly with variable match length, we'd have faced the typical choice of "shift or reduce" in parsers, and we may have to backtrack.

Python parse does it by delegating to regex, which has full backtracking support. But it also suffers from potential exponential backtracking given a malicious or unfortunate input.

And that'll take away a key design goal of StringFormat: to be used in highly stability and efficiency-sensitive servers.

Readability

The format specifiers, along with regex, improve the expressivity, but at the cost of readability.

For example, almost every time I need to use format_timestamp() in BigQuery, I have to look up which specifier I need to use for the datetime format I have in my head.

So are the JDK DateTimeFormatter specifiers.

And when I read the specifiers, same thing: I don't really know what they mean without looking up the document.

That is inconsistent with another goal of StringFormat: what you see is what you get and you can immediately know what the format is.

Workarounds

That's a lot of "philosophy". But it doesn't answer your question of how you can parse out the datetime embedded in your input string. For example: "My birthday is: Thu, May 22 2025.", assuming you want to parse out that date in LocalDate.

By StringFormat's "philosophy" (again): no cryptic specifiers. no dsl. You parse it out explicitly, in Java.

For example, this is how I would do it myself:

import static com.google.mu.time.DateTimeFormats.parseLocalDate;

LocalDate birthday = new StringFormat("My birthday is: {date}.")
    .parseOrThrow(input, date -> parseLocalDate(date));
Enter fullscreen mode Exit fullscreen mode

Or, if you use your own DateTimeFormatter, or directly use LocalDate.parse(), your choice.

The idea is that you do it explicitly, and if it fails, you get the stack trace pointing to the offending line.

Hope that makes sense.