Painless regex with Apache Groovy

Regular expressions – love them or hate them, they’re an inescapable reality for most programmers, especially for data wranglers like me who constantly wrestle data into submission. (If you haven’t installed Groovy yet, please read the intro to this series.)

This quote, oft-cited though who said it first remains murky, seems obligatory:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

They remain a powerful tool for recognizing and transforming text — especially text that was never meant to be recognized or transformed. Groovy has at least taken a stab at making regular expressions, if not friendlier, at least, well groovier.

So, what does Groovy bring to regular expressions beyond Java’s base capabilities?

In the last article, you discovered slashy strings, which eliminate the need to escape the backslash. These help in dealing with regular expressions.

Let’s say you’re reading some HTML from somewhere and you’re looking for terms defined using the description list.
A description list looks like this:

<dl>
  <dt>winter</dt>
    <dd>winter solstice to spring equinox</dd>
  <dt>spring</dt>
    <dd>spring equinox to summer solstice</dd>
  <dt>summer</dt>
    <dd>summer solstice to fall equinox</dd>
  <dt>autumn</dt>
    <dd>fall equinox to winter solstice</dd>
</dl>

Which would render as:

winter
winter solstice to spring equinox
spring
spring equinox to summer solstice
summer
summer solstice to fall equinox
autumn
fall equinox to winter solstice

The terms being defined occur between <dt> and </dt> elements.

A regular expression that would locate these terms, using slashy string notation, would look like:

~/<dt>.*<\/dt>/

Note the ~ in front, which indicates the following string will be compiled into a regular expression and that you had to escape the / character.

Suppose you’re reading an HTML document and looking to see what terms are defined in the document.
A script like this would come close to meeting your needs:

1  String html = """
2  <html>
3    <head>
4      <title>Seasons</title>
5    </head>
6    <body>
7      <h1>Learning about the seasons</h1>
8      <p>What is the formal definition of the four seasons of the year?</p>
9      <dl>
10        <dt>winter</dt>
11        <dd>winter solstice to spring equinox</dd>
12        <dt>spring</dt>
13        <dd>spring equinox to summer solstice</dd>
14        <dt>summer</dt>
15        <dd>summer solstice to fall equinox</dd>
16        <dt>autumn</dt>
17        <dd>fall equinox to winter solstice</dd>
18      </dl>
19    </body>
20  </html>
21  """
22  def dts = html =~ /<dt>.*<\/dt>/
23  for (f in dts) {
24    println f
25  }

When you run it, you get:

$ groovy Groovy11a.groovy
<dt>winter</dt>
<dt>spring</dt>
<dt>summer</dt>
<dt>autumn</dt>

That symbol =~ is the Groovy find operator. Using it in this way yields an instance of the Matcher class. You can iterate over the matches in the Matcher instance to get the matched substrings.

Sometimes, dealing with a Matcher object is too complicated. Maybe you just want to identify whether there’s a pattern of interest on any of the lines of data. In that case, you can use the match operator, ==~:

1  String html = """
2  <html>
3    <head>
4      <title>Seasons</title>
5    </head>
6    <body>
7      <h1>Learning about the seasons</h1>
8      <p>What is the formal definition of the four seasons of the year?</p>
9      <dl>
10        <dt>winter</dt>
11        <dd>winter solstice to spring equinox</dd>
12        <dt>spring</dt>
13        <dd>spring equinox to summer solstice</dd>
14        <dt>summer</dt>
15        <dd>summer solstice to fall equinox</dd>
16        <dt>autumn</dt>
17        <dd>fall equinox to winter solstice</dd>
18      </dl>
19    </body>
20  </html>
21  """

22  for (line in html.split(/\n/)) {
23    if (line ==~ /.*<dt>.*<\/dt>.*/)
24      println line
25  }

When you run this, you see:

$ groovy Groovy11b.groovy
      <dt>winter</dt>
      <dt>spring</dt>
      <dt>summer</dt>
      <dt>autumn</dt>

A couple of important things to note — the match operator matches (or not) the entire string. So you had to change the pattern to allow for text before and after the pattern of interest. Note that all you have done is identify that the test string has the pattern somewhere in it.

Where the match operator has been especially useful for my work is in Groovy’s switch statement. For example, you can pull the terms and definitions from the sample HTML using a switch statement and the Groovy String enhancement method takeBetween() you learned about two articles ago as follows:

1  String html = """
2  <html>
3    <head>
4      <title>Seasons</title>
5    </head>
6    <body>
7      <h1>Learning about the seasons</h1>
8      <p>What is the formal definition of the four seasons of the year?</p>
9      <dl>
10        <dt>winter</dt>
11        <dd>winter solstice to spring equinox</dd>
12        <dt>spring</dt>
13        <dd>spring equinox to summer solstice</dd>
14        <dt>summer</dt>
15        <dd>summer solstice to fall equinox</dd>
16        <dt>autumn</dt>
17        <dd>fall equinox to winter solstice</dd>
18      </dl>
19    </body>
20  </html>
21  """
22  def definitionList = []
23  def definition = [:]
24  html.split(/\n/).each { line ->
25    switch (line) {
26    case ~/.*<dt>.*<\/dt>.*/:
27      definition.term = line.takeBetween("<dt>", "</dt>")
28      break
29    case ~/.*<dd>.*<\/dd>.*/:
30      definition.definition = line.takeBetween("<dd>", "</dd>")
31      break
32    default:
33      break
34    }
35    if (definition.containsKey("term") && definition.containsKey("definition")) {
36      definitionList << definition
37      definition = [:]
38    }
39  }
40  println "Definition(s) encountered in HTML:"
41  definitionList.each { d ->
42    println "${d.term}: ${d.definition}"
43  }

Here are some comments about the above code:

In lines 24-39 I use the each() method defined for lists with a closure to loop over each line of the HTML (in the previous example I used the Groovy for… in statement but generally I prefer to use each in this situation).
In lines 25-34 I use the Groovy switch statement with pattern matching to detect lines with <dt>… </dt> and <dd>… </dd> in them. I use the takeBetween() method to get the text from between those elements, saving that text in the definition map defined on line 23.
In lines 35-38 I look to see if the definition map has both the term and definition keys and if so, I append it to the definitionList before resetting it to the empty map.
In lines 41-43 I use the each() method defined for lists with a closure to loop over each definition map in the definitionList.
I could make this groovier by using the specialized collect() method for lists but I’ll leave that for a subsequent article.
I could have just printed the definition map each time it was full rather than accumulating it into the definitionList. But in my experience, there’s often a need to separate the two steps and carry out intermediate processing between them.

When you run this, you get:

$ groovy Groovy11c.groovy
Definition(s) encountered in HTML:
winter: winter solstice to spring equinox
spring: spring equinox to summer solstice
summer: summer solstice to fall equinox
autumn: fall equinox to winter solstice

Conclusion

Groovy has some excellent support for programmers who need regular expressions. You can use slashy strings that simplify the expressions by removing the need to escape the backslash character. You also have find and match operators or you can use the enhanced switch statement that can take regular expressions in its case clauses.

Again, the syntactic support for these elements makes the code much more compact and readable than the Java equivalents (where they even exist). About the only thing that messes me up is the match operator in Groovy is similar but different from the match operator in awk (yes I still use awk a fair bit, though not as often as Groovy these days).