Parsing structured strings in Java has always been painful. Most developers reach for regular expressions, split()
, or manual slicing. But these techniques are error-prone, hard to read, and most importantly—unsafe at compile time.
The StringFormat
class makes parsing so easy that even a beginner can implement with a one-liner.
🧩 Regex is Not Easy
Take a common use case: parsing a file path like this:
/logs/2024/05/16/system.log
You want to extract year, month, day, and filename. Here’s what most people do:
private static final Pattern LOG_PATH = Pattern.compile(
"/logs/(\\d{4})/(\\d{2})/(\\d{2})/(.+)\\.log"
);
// elsewhere in code:
Matcher matcher = LOG_PATH.matcher(path);
if (matcher.matches()) {
String year = matcher.group(1);
String month = matcher.group(2);
String day = matcher.group(3);
String file = matcher.group(4);
}
Some problems:
- Regex pattern readability sucks.
- Use group names and your pattern readability sucks even more.
- With pattern and extraction logic far apart, you could get the groups out of order.
- Group indices are magic numbers.
✅ Structured Parsing with StringFormat
Tthe same logic using StringFormat
:
private static final StringFormat FORMAT =
new StringFormat("/logs/{year}/{month}/{day}/{file}.log");
FORMAT.parseOrThrow(
"/logs/2024/05/16/system.log",
(year, month, day, file) ->
new Log(parseInt(year), parseInt(month), parseInt(day), file);
That’s it. No string math, no group numbers.
🛡️ Compile-time Safety
StringFormat performs compile-time validation of the lambda:
// ❌ Error: parameter count mismatch
new StringFormat("{a}-{b}")
.parseOrThrow("1-2", (a, b, c) -> { });
// ~~~~~~~~~~~~~~~
// Compilation error: too many parameters
// ❌ Error: parameter order mismatch
new StringFormat("{a}-{b}")
.parseOrThrow("1-2", (b, a) -> { });
// ~~~~~~~
// Compilation error: expected order (a, b)
// ✅ Correct: order matches field declaration
new StringFormat("{a}-{b}")
.parseOrThrow("1-2", (a, b) -> ...);
This level of safety is not possible with regex, split()
, or ad-hoc parsing.
No need to “remember” group indices or keep documentation in sync with code—the compiler checks it for you.
✅ Scanning repeatedly
Say, if you have many such file paths in the input string, you can lazily scan them all:
List<Log> logs = FORMAT.scan(input, (year, month, day, file) -> ...)
.filter(...) // apply whatever filtering you care about
.limit(10) // if you want up to 10 such files
.toList();
✅ Regex is Unpredictable
Regex engines in Java (and many other platforms) use NFA-based backtracking, which means certain patterns (especially involving nested repetitions or ambiguous alternations) can cause catastrophic slowdowns.
Even simple-looking regexes like:
href="([^"]*)"
or:
((a|b|c)+)+
can trigger catastrophic backtracking when matched against malicious input like nested quotes or repeated characters. These aren’t toy examples—real-world systems have gone down because of them:
- Stack Overflow 2016: a regex used to extract comment anchors caused a global outage due to backtracking explosion (postmortem).
- Cloudflare 2019: a single WAF rule with a pathological regex caused CPU saturation and took down large parts of the network (incident report).
RE2, Google's safe regex engine, avoids this with DFA-based guarantees—but at the cost of expressive power (e.g., no backreferences or lookaround) and often slower than hand-written substring logic.
✅ StringFormat uses indexOf()
calls
StringFormat
avoids regex entirely. It splits the template into static fragments and uses a left-to-right scan driven by String.indexOf(...)
to locate placeholder matches.
This gives several benefits:
- ✅ Deterministic linear scanning
- ✅ No backtracking
- ✅ No regex syntax or escaping
- ✅ Fast constant-time fragment matching (in practice,
indexOf(...)
is extremely fast for short needles) - ✅ Safe under adversarial input (no regex ReDoS risk)
In real workloads, where fragments are small (/
, :
, @
, etc.) and inputs are well-structured, indexOf()
outperforms regex by a wide margin—often many times faster per match.
🧠 Summary
If you're building tooling that parses structured text:
- Don't use regex unless you need full pattern generality.
- Don’t trust that it will behave the same under scale or attack.
- Use simple substring matching where structure allows—like
StringFormat
does.
↔️ Bidirectional
While parsing is the main use case, the same format string also supports formatting:
String path = FORMAT.format(year, month, day, file);
The format string is always the source of truth. No string concatenation. No misplaced .append()
chains.
Similar compile-time protection exists: you'll get compilation error if you pass in wrong number of args, or for example get the file
and year/month/day
in the wrong order.
With traditional String.format("input size for %s is %s", user.id(), input.size())
, it's not a good idea to stash away the format string as a constant because then it's easy to pass the format arguments in the wrong order.
But with StringFormat
and its compile-time check, it's safe to do so, making it easier to reuse the same format string at different places.
format()
uses direct string concatenation (+
), which is faster than StringBuilder
on Java 9 and higher.
🐍 Python's parse
Looking around, Python offers similar syntax in its parse
library:
from parse import parse
result = parse("/logs/{year}/{month}/{day}/{file}.log", "/logs/2024/05/17/server.log")
print(result.named) # {'year': '2024', 'month': '05', 'day': '17', 'file': 'server'}
Except parse()
offers no compile-time enforcements, and under the hood is a wrapper of regex, so suffers the same NFA-based regex performance overhead and potential disastrous backtracking.
🌍 Other Languages
Language | Closest Equivalent | Readability | Compile-Time Safety | Bidirectional | Notes |
---|---|---|---|---|---|
Java | StringFormat |
✅ High | ✅ Yes | ✅ Yes | Template-based parsing with lambda + type safety |
Python | parse |
✅ High | ❌ No | ✅ Yes | Clean syntax, runtime-only verification |
JS/TS |
path-to-regexp , match()
|
⚠️ Medium | ❌ No | ⚠️ Partial | URL-focused, lacks general structure matching |
Go |
regexp (manual group extraction) |
❌ Low | ❌ No | ❌ No | Verbose and error-prone, no field names |
C++ |
std::regex , manual parsing |
❌ Low | ❌ No | ❌ No | No built-in structure mapping; verbose |
Kotlin |
Regex , manual destructuring |
⚠️ Medium | ❌ No | ❌ No | No declarative templates, no type checks |
🔍 Observations
-
Python is the only language with syntax on par with
StringFormat
, but all verification is deferred to runtime. - JavaScript, Go, C++, and Kotlin rely on regex or ad-hoc logic without structural templates.
- Only Java + StringFormat delivers template-based parsing with lambda field binding, and compiler-checked safety.
👉 Github Repo: Mug
Top comments (2)
Can
StringFormat
support automatic type validation and conversion like%d
or Python parse's:d
?It seems
parse
can also support datetime parsing, liketi
?Yeah. You've hit a major design trade-off of StringFormat.
There were two considerations that prevented the addition of format specifiers:
No Backtracking
StringFormat is linear search (using
String.indexOf()
) due to its pattern simplicity.If we had to support a custom pattern language like datetime, particularly with variable match length, we'd have faced the typical choice of "shift or reduce" in parsers, and we may have to backtrack.
Python
parse
does it by delegating to regex, which has full backtracking support. But it also suffers from potential exponential backtracking given a malicious or unfortunate input.And that'll take away a key design goal of
StringFormat
: to be used in highly stability and efficiency-sensitive servers.Readability
The format specifiers, along with regex, improve the expressivity, but at the cost of readability.
For example, almost every time I need to use
format_timestamp()
in BigQuery, I have to look up which specifier I need to use for the datetime format I have in my head.So are the JDK DateTimeFormatter specifiers.
And when I read the specifiers, same thing: I don't really know what they mean without looking up the document.
That is inconsistent with another goal of
StringFormat
: what you see is what you get and you can immediately know what the format is.Workarounds
That's a lot of "philosophy". But it doesn't answer your question of how you can parse out the datetime embedded in your input string. For example: "My birthday is: Thu, May 22 2025.", assuming you want to parse out that date in
LocalDate
.By StringFormat's "philosophy" (again): no cryptic specifiers. no dsl. You parse it out explicitly, in Java.
For example, this is how I would do it myself:
Or, if you use your own
DateTimeFormatter
, or directly useLocalDate.parse()
, your choice.The idea is that you do it explicitly, and if it fails, you get the stack trace pointing to the offending line.
Hope that makes sense.