
Painless regex with Apache Groovy
Regular expressions – love them or hate them, they’re an inescapable reality for most programmers, especially for data wranglers like me who constantly wrestle data into submission. (If you haven’t installed Groovy yet, please read the intro to this series.)
This quote, oft-cited though who said it first remains murky, seems obligatory:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
They remain a powerful tool for recognizing and transforming text — especially text that was never meant to be recognized or transformed. Groovy has at least taken a stab at making regular expressions, if not friendlier, at least, well groovier.

So, what does Groovy bring to regular expressions beyond Java’s base capabilities?
In the last article, you discovered slashy strings, which eliminate the need to escape the backslash. These help in dealing with regular expressions.
Let’s say you’re reading some HTML from somewhere and you’re looking for terms defined using the description list.
A description list looks like this:
<dl>
<dt>winter</dt>
<dd>winter solstice to spring equinox</dd>
<dt>spring</dt>
<dd>spring equinox to summer solstice</dd>
<dt>summer</dt>
<dd>summer solstice to fall equinox</dd>
<dt>autumn</dt>
<dd>fall equinox to winter solstice</dd>
</dl>
Which would render as:
winter
winter solstice to spring equinox
spring
spring equinox to summer solstice
summer
summer solstice to fall equinox
autumn
fall equinox to winter solstice
The terms being defined occur between <dt>
and </dt>
elements.
A regular expression that would locate these terms, using slashy string notation, would look like:
~/<dt>.*<\/dt>/
Note the ~
in front, which indicates the following string will be compiled into a regular expression and that you had to escape the /
character.
Suppose you’re reading an HTML document and looking to see what terms are defined in the document.
A script like this would come close to meeting your needs:
1 String html = """
2 <html>
3 <head>
4 <title>Seasons</title>
5 </head>
6 <body>
7 <h1>Learning about the seasons</h1>
8 <p>What is the formal definition of the four seasons of the year?</p>
9 <dl>
10 <dt>winter</dt>
11 <dd>winter solstice to spring equinox</dd>
12 <dt>spring</dt>
13 <dd>spring equinox to summer solstice</dd>
14 <dt>summer</dt>
15 <dd>summer solstice to fall equinox</dd>
16 <dt>autumn</dt>
17 <dd>fall equinox to winter solstice</dd>
18 </dl>
19 </body>
20 </html>
21 """
22 def dts = html =~ /<dt>.*<\/dt>/
23 for (f in dts) {
24 println f
25 }
When you run it, you get:
$ groovy Groovy11a.groovy
<dt>winter</dt>
<dt>spring</dt>
<dt>summer</dt>
<dt>autumn</dt>
That symbol =~
is the Groovy find
operator. Using it in this way yields an instance of the Matcher
class. You can iterate over the matches in the Matcher
instance to get the matched substrings.
Sometimes, dealing with a Matcher
object is too complicated. Maybe you just want to identify whether there’s a pattern of interest on any of the lines of data. In that case, you can use the match
operator, ==~
:
1 String html = """
2 <html>
3 <head>
4 <title>Seasons</title>
5 </head>
6 <body>
7 <h1>Learning about the seasons</h1>
8 <p>What is the formal definition of the four seasons of the year?</p>
9 <dl>
10 <dt>winter</dt>
11 <dd>winter solstice to spring equinox</dd>
12 <dt>spring</dt>
13 <dd>spring equinox to summer solstice</dd>
14 <dt>summer</dt>
15 <dd>summer solstice to fall equinox</dd>
16 <dt>autumn</dt>
17 <dd>fall equinox to winter solstice</dd>
18 </dl>
19 </body>
20 </html>
21 """
22 for (line in html.split(/\n/)) {
23 if (line ==~ /.*<dt>.*<\/dt>.*/)
24 println line
25 }
When you run this, you see:
$ groovy Groovy11b.groovy
<dt>winter</dt>
<dt>spring</dt>
<dt>summer</dt>
<dt>autumn</dt>
A couple of important things to note — the match
operator matches (or not) the entire string. So you had to change the pattern to allow for text before and after the pattern of interest. Note that all you have done is identify that the test string has the pattern somewhere in it.
Where the match
operator has been especially useful for my work is in Groovy’s switch
statement. For example, you can pull the terms and definitions from the sample HTML using a switch
statement and the Groovy String
enhancement method takeBetween()
you learned about two articles ago as follows:
1 String html = """
2 <html>
3 <head>
4 <title>Seasons</title>
5 </head>
6 <body>
7 <h1>Learning about the seasons</h1>
8 <p>What is the formal definition of the four seasons of the year?</p>
9 <dl>
10 <dt>winter</dt>
11 <dd>winter solstice to spring equinox</dd>
12 <dt>spring</dt>
13 <dd>spring equinox to summer solstice</dd>
14 <dt>summer</dt>
15 <dd>summer solstice to fall equinox</dd>
16 <dt>autumn</dt>
17 <dd>fall equinox to winter solstice</dd>
18 </dl>
19 </body>
20 </html>
21 """
22 def definitionList = []
23 def definition = [:]
24 html.split(/\n/).each { line ->
25 switch (line) {
26 case ~/.*<dt>.*<\/dt>.*/:
27 definition.term = line.takeBetween("<dt>", "</dt>")
28 break
29 case ~/.*<dd>.*<\/dd>.*/:
30 definition.definition = line.takeBetween("<dd>", "</dd>")
31 break
32 default:
33 break
34 }
35 if (definition.containsKey("term") && definition.containsKey("definition")) {
36 definitionList << definition
37 definition = [:]
38 }
39 }
40 println "Definition(s) encountered in HTML:"
41 definitionList.each { d ->
42 println "${d.term}: ${d.definition}"
43 }
Here are some comments about the above code:
- In lines 24-39 I use the
each()
method defined for lists with a closure to loop over each line of the HTML (in the previous example I used the Groovyfor… in
statement but generally I prefer to useeach
in this situation). - In lines 25-34 I use the Groovy
switch
statement with pattern matching to detect lines with<dt>… </dt>
and<dd>… </dd>
in them. I use thetakeBetween()
method to get the text from between those elements, saving that text in thedefinition
map defined on line 23. - In lines 35-38 I look to see if the
definition
map has both theterm
anddefinition
keys and if so, I append it to thedefinitionList
before resetting it to the empty map. - In lines 41-43 I use the
each()
method defined for lists with a closure to loop over each definition map in thedefinitionList
. - I could make this groovier by using the specialized
collect()
method for lists but I’ll leave that for a subsequent article. - I could have just printed the
definition
map each time it was full rather than accumulating it into thedefinitionList
. But in my experience, there’s often a need to separate the two steps and carry out intermediate processing between them.
When you run this, you get:
$ groovy Groovy11c.groovy
Definition(s) encountered in HTML:
winter: winter solstice to spring equinox
spring: spring equinox to summer solstice
summer: summer solstice to fall equinox
autumn: fall equinox to winter solstice
Conclusion
Groovy has some excellent support for programmers who need regular expressions. You can use slashy strings that simplify the expressions by removing the need to escape the backslash character. You also have find
and match
operators or you can use the enhanced switch
statement that can take regular expressions in its case
clauses.
Again, the syntactic support for these elements makes the code much more compact and readable than the Java equivalents (where they even exist). About the only thing that messes me up is the match
operator in Groovy is similar but different from the match
operator in awk
(yes I still use awk
a fair bit, though not as often as Groovy these days).