I have a huge list of regexes (>1,000 but <1,000,000) that I want to test against (many) single strings.
It is unlikely and unintended that more than one such expression would match a single string. I could just maintain a big list of each compiled, individual regex and iterate over that for every input string. However I have it in my head that I should be handing the problem over to the regex compiler to simplify the common substrings since it can (at least theoretically) produce a very neat single DFA.
import re
import uuid
class multiregex(object):
def __init__(self,rules):
merge = []
self._messages = {}
for regex,text in rules:
name = "g"+str(uuid.uuid4()).replace('-','')
merge += ["(?P<%s>%s)" % (name,regex)]
self._messages[name] = text
self._re=re.compile('|'.join(merge))
def __call__(self,s):
result = self._re.search(s)
if result:
groups = result.groupdict()
return ((self._messages[x], groups[x]) for x in groups.keys() if groups[x]).next()
rules = [("foobar", "Hit a foobar"),
("f.*b.*r", "fbr"),
("foob.z", "Frobination"),
("baz", "Hit a baz"),
("b(ingo)?", "b with optional ingo")]
m=multiregex(rules)
tests=["foobar", "foobaz", "foobazr", "b", "bingo"]
for text,hit in (m(x) for x in tests):
print "Message: '%s' (because of '%s')" % (text,hit)
The code above works, but am I have a few outstanding issues with it:
- Is it needlessly over complicating the whole thing? Or is it pushing the problem off to code that's heavily researched and optimised.
- Is there a neater way of finding just the named capture group that matched than what I've done with
groupdict()? Are there any more gotchas than the obvious one of two 'rules' each containing the same group name? e.g.:
rules = [("(?P<hello>foobar)", "Hit a foobar"), ("(?P<hello>foob.z)", "Frobination")]
(The issue of a single syntax error in a single 'rule' killing the whole thing is easy enough to workaround by validating the inputs at rule creation time)