6

I have an exepression to validate an email address:

string REGEX_EMAIL = @"^\w+([\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*@\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$";

If address is correct IsMatch() method quickly shows true result. But if address string is long and wrong this method hangs.

What can I do to raise speed of this method?

Thanks.

4
  • Can you show us how you are creating the Regex object? Commented Nov 29, 2011 at 19:08
  • I'm not liking the look of that regex. Commented Nov 29, 2011 at 19:10
  • I'm no regex expert, but I wouldn't be surprised if those nested variable length quantifiers lead to exponential runtime. Commented Nov 29, 2011 at 19:14
  • Also, the pipe symbol is a broken bar (¦, U+00A6) instead of a vertical line (|, U+007C). Where did you get that regex from? Commented Nov 29, 2011 at 19:14

3 Answers 3

8

You are experiencing catastrophic backtracking.

The simplified regex:

Regex regexObj = new Regex(@"^\w+([-.!#$%&'*+/=?^`{¦}~]*\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*$");

Has potential problems e.g. ([.-]\w+)*\.

If the . is missing for example and you have a long string of characters before it then all possible combinations must be taken into account for your regex to find out that it actually fails.

Sign up to request clarification or add additional context in comments.

Comments

4

You have a couple things going on which are hurting the performance in this regular expression.

  1. Catastrophic backtracking
  2. Too many optional statements

You can definitely improve performance by using the + instead of the * in a few key places, but this of course changes what the regular expression will and won't match. So the easiest fix I found is actually covered in the catastrophic backtracking article above. You can use the nonbacktracking subexpression to drastically improve performance in this case, without changing the regular expression's behavior in any way that matters.

The nonbacktracking subexpression looks like this... (?>pattern)

So try this regular expression instead:

^\w+(?>[\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*\w+)*@\w+([\.\-]\w+)*\.\w+([\.\-]\w+)*$

On a slightly related topic, my philosophy for checking for a valid email address is a bit different. For one, long regular expressions like this one can potentially have performance problems as you've found.

Secondly, there's the upcoming promise of email address internationalization which complicates all of this even more.

Lastly, the main purpose of any regular expression based email validation is to catch typos and blatant attempts to get through your form without entering a real email address. But to check if an email address is genuine requires that you send an email to that address.

So my philosophy is to err on the side of accepting too much. And that, in fact, is a very simple thing to do...

^.+@.+\..+$

This should match any conceivably valid email address, and some invalid ones as well.

2 Comments

How about: ([\w-\.]+)@((?:[\w]+)+)\.([\w]{2,4}) That way you don't have to specify all the characters individually.
@p.wilt - I was just trying to fix the performance problem with minimal changes. If I were to suggest my own regular expression for email validation, it'd be the little one in my latest update.
0

So, there is some backtracking issues. You can reduce those issues with a prudent use of the independent subexpression constructs, but you will still have issues because inner expressions won't have that constraint. The best thing to do is to separate the major parts.

Changing it to this helps alot (expanded):

^
  (?>
     \w+
     (
       [\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~]*
       \w+
     )*
     @
     \w+
   )
   (?>
     ([\.\-]\w+)*
     \.
     \w+
     ([\.\-]\w+)*
   )
$

However, if you refactor to the equivalent expression by putting some well placed assertions, then re-adding the independent subexpressions grouping, you can virtually eliminate backtracking. Running this through my regex dubugger shows it takes only a very few steps to either pass or fail (expanded):

^
  (?>
     \w+
     [\.\!\#\$\%\&\'\*\+\-\/\=\?\^\`\{\¦\}\~\w]*
     (?<=\w)
     @
     \w+
  )     
  (?=.*\.\w)
  (?>    
     ([\.\-]\w+)+
  )
$

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.