Trying multiple regexes against a single string

Question

I have a huge list of regexes (>1,000 but <1,000,000) that I want to test against (many) single strings.

It is unlikely and unintended that more than one such expression would match a single string. I could just maintain a big list of each compiled, individual regex and iterate over that for every input string. However I have it in my head that I should be handing the problem over to the regex compiler to simplify the common substrings since it can (at least theoretically) produce a very neat single DFA.

import re
import uuid

class multiregex(object):
    def __init__(self,rules):
        merge = []
        self._messages = {}
        for regex,text in rules:
            name = "g"+str(uuid.uuid4()).replace('-','')
            merge += ["(?P<%s>%s)" % (name,regex)]
            self._messages[name] = text

        self._re=re.compile('|'.join(merge))

    def __call__(self,s):
        result = self._re.search(s)
        if result:
            groups = result.groupdict()
            return ((self._messages[x], groups[x]) for x in groups.keys() if groups[x]).next()

rules = [("foobar", "Hit a foobar"),
         ("f.*b.*r", "fbr"),
         ("foob.z", "Frobination"),
         ("baz", "Hit a baz"),
         ("b(ingo)?", "b with optional ingo")]

m=multiregex(rules)   
tests=["foobar", "foobaz", "foobazr", "b", "bingo"]
for text,hit in (m(x) for x in tests):
    print "Message: '%s' (because of '%s')" % (text,hit)

The code above works, but am I have a few outstanding issues with it:

Is it needlessly over complicating the whole thing? Or is it pushing the problem off to code that's heavily researched and optimised.
Is there a neater way of finding just the named capture group that matched than what I've done with groupdict()?
Are there any more gotchas than the obvious one of two 'rules' each containing the same group name? e.g.:
```
rules = [("(?P<hello>foobar)", "Hit a foobar"),
         ("(?P<hello>foob.z)", "Frobination")]
```

(The issue of a single syntax error in a single 'rule' killing the whole thing is easy enough to workaround by validating the inputs at rule creation time)

If you have > 1000 regular expressions to test, are you really sure that you want to use regular expressions? It sounds to me like you are building a parser of some kind, in which case regex is not the way to go I believe. — Simon Forsberg
– Simon Forsberg, Commented Feb 1, 2014 at 11:43
@SimonAndréForsberg regex is definitely right here - they're essentially heuristics contributed by (many) domain experts and the input strings are too unstructured to do anything smarter with a proper lexer/parser. — Flexo - Save the data dump
– Flexo - Save the data dump, Commented Feb 1, 2014 at 12:02
What is the purpose of the regexes? In your code above it prints that a match happened because of a specific rule... Is this needed, or is just 'It Matches!!!' OK (i.e. do you need to know which rule matched)? — rolfl
– rolfl, Commented Feb 1, 2014 at 16:29
@rolfl the knowledge of which one matched is important - the messages need to get relayed back to users, which is why I return a tuple of the message and the text that matched. — Flexo - Save the data dump
– Flexo - Save the data dump, Commented Feb 1, 2014 at 16:30
In which case, combining the regexes in to a larger DFA will likely lose you that possibility.... right? — rolfl
– rolfl, Commented Feb 1, 2014 at 16:34

Quentin Pradet · Accepted Answer · 2014-02-12 10:28:10Z

5

I think it's a neat idea, because you're indeed using well-tested code, which reduces the chance of errors.
Looking at the re API, you do need to retrieve all possibles matches using groupdict().
Even that one is not a gotcha since you're naming your groups yourself. Right? I can't think of anything else.

Other comments:

You have a small bug in your last loop where text and hit need to be exchanged.
There's no easy to way to be sure that the DFA version will be faster than the normal version. Since you're not producing a factorized DFA but a sum of DFAs, naive code could be as slow as testing each regex one by one. Of course it's possible that it's much faster, but measure it if this is what you want to achieve!
You don't need uuids, a simple counter would be enough. Eg ?P<g1> instead of ?P<fd897214dd9d4dd28f591a412ef5d3ea>.

answered Feb 12, 2014 at 10:28

Quentin Pradet

7,0641 gold badge25 silver badges44 bronze badges

\$\begingroup\$ I did the UUID thing just in case someone else had put a group name inside their regex. \$\endgroup\$

Flexo - Save the data dump
– Flexo - Save the data dump

2014-02-12 16:37:02 +00:00
Commented Feb 12, 2014 at 16:37
\$\begingroup\$ Oh, right, it makes sense. \$\endgroup\$

Quentin Pradet
– Quentin Pradet

2014-02-12 16:49:49 +00:00
Commented Feb 12, 2014 at 16:49

Add a comment |

Stack Exchange Network

Trying multiple regexes against a single string

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Trying multiple regexes against a single string

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions