Regex to filter out strings with certain pattern

Question

I want to identify a string, for instance:

a = 'KI83949 anythingHere 900.00 1 900.00'

the string consists of three part:

index part is the string before the first space 
- 'KI83949'

which can be anything and for most of the time, it is chars+digits.

string between a[0] and the first floating number 
with two decimal points is the seconde part
-'anything here'

which can be anything

starting with the 2-decimal-point floating number is the last part
-'900.00 1 900.00'

which can be

'900.00' or '900.00 1 1003.00' or '900.00 100.00'
float or float+int+float or float+float

Numbers here will change. For the whole string, the number part is always there while the previous two parts may not. I am trying to filter out string with above features from thousands of other strings. I have tried several ways to express this but still failed. Sorry for my poor regex knowledge. The most close shot is as following:

'.*\s?[\d.]+(\s\d)?[\s\d.]+$'

However, it will return something like 'TS90190' or '80 thda 4318'. After spending hours on this, now it is driving me crazy. Can someone help me with that?

Andrew Cheong · Accepted Answer · 2013-10-24 06:25:51Z

3

.* is greedy—it will attempt to match as much as possible, i.e. more than the first word, which is probably the primary reason you're finding unexpected results. To start, you can make that non-greedy by adding a question mark, e.g. .*?.

But, a more stringent method would be to match only non-space characters to start:

^[^\s]+

The ^ in the beginning is known as an anchor, and asserts that the match starts at the beginning of the string (or line, in multi-line mode).

Let's see what's next. You want to match up to the first float, right? Sounds like we need a non-greedy quantifier of some sort!

^[^\s]+\s+(.*?)\d+\.\d\d

The above can get buggy under certain circumstances, perhaps a bit too complex to explain at your level currently. If you know that your language or implementation supports lookahead assertions however, then this will be much more robust:

^[^\s]+\s+(.(?!\d+\.\d\d))+

What this is doing is it's matching every character . as long as it's not followed by (that's the part that's called a negative lookahead assertion) a float.

answered Oct 24, 2013 at 6:25

Andrew Cheong

30.4k17 gold badges103 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

fyr91 Over a year ago

Thank you so much for saving my life, this is working perfectly. Do you have any recommendations of where to learn regex?

fyr91 Over a year ago

Thank you for the explanation of the second one. Although it is quite complicated for me, but still very helpful.

Andrew Cheong Over a year ago

Glad to help. There are a lot of tutorials online, among which regular-expressions.info seems to be a popular one. But frankly I only know what I know from years of experience (sometimes unwanted experience!) and especially from hanging around SO and answering regex questions. (As they say, there's no better teacher than teaching.) Feel free to always come by to ask for guidance though; most people here are happy to explain and teach according to the scope you need. (As long as you don't ask about regex'ing HTML, in which case people will bite your head off.)

Andrew Cheong Over a year ago

Also, you can check this out: debuggex.com. You type in your regex and your test data, and it draws a diagram that might help clarify what you're constructing.

fyr91 Over a year ago

Thank you for the recommendations! :)

Collectives™ on Stack Overflow

Regex to filter out strings with certain pattern

1 Answer 1

5 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Related