2

I want to identify a string, for instance:

a = 'KI83949 anythingHere 900.00 1 900.00'

the string consists of three part:

index part is the string before the first space 
- 'KI83949'

which can be anything and for most of the time, it is chars+digits.

string between a[0] and the first floating number 
with two decimal points is the seconde part
-'anything here'

which can be anything

starting with the 2-decimal-point floating number is the last part
-'900.00 1 900.00'

which can be

'900.00' or '900.00 1 1003.00' or '900.00 100.00'
float or float+int+float or float+float 

Numbers here will change. For the whole string, the number part is always there while the previous two parts may not. I am trying to filter out string with above features from thousands of other strings. I have tried several ways to express this but still failed. Sorry for my poor regex knowledge. The most close shot is as following:

'.*\s?[\d.]+(\s\d)?[\s\d.]+$'

However, it will return something like 'TS90190' or '80 thda 4318'. After spending hours on this, now it is driving me crazy. Can someone help me with that?

1 Answer 1

3

.* is greedy—it will attempt to match as much as possible, i.e. more than the first word, which is probably the primary reason you're finding unexpected results. To start, you can make that non-greedy by adding a question mark, e.g. .*?.

But, a more stringent method would be to match only non-space characters to start:

^[^\s]+

The ^ in the beginning is known as an anchor, and asserts that the match starts at the beginning of the string (or line, in multi-line mode).

Let's see what's next. You want to match up to the first float, right? Sounds like we need a non-greedy quantifier of some sort!

^[^\s]+\s+(.*?)\d+\.\d\d

The above can get buggy under certain circumstances, perhaps a bit too complex to explain at your level currently. If you know that your language or implementation supports lookahead assertions however, then this will be much more robust:

^[^\s]+\s+(.(?!\d+\.\d\d))+

What this is doing is it's matching every character . as long as it's not followed by (that's the part that's called a negative lookahead assertion) a float.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much for saving my life, this is working perfectly. Do you have any recommendations of where to learn regex?
Thank you for the explanation of the second one. Although it is quite complicated for me, but still very helpful.
Glad to help. There are a lot of tutorials online, among which regular-expressions.info seems to be a popular one. But frankly I only know what I know from years of experience (sometimes unwanted experience!) and especially from hanging around SO and answering regex questions. (As they say, there's no better teacher than teaching.) Feel free to always come by to ask for guidance though; most people here are happy to explain and teach according to the scope you need. (As long as you don't ask about regex'ing HTML, in which case people will bite your head off.)
Also, you can check this out: debuggex.com. You type in your regex and your test data, and it draws a diagram that might help clarify what you're constructing.
Thank you for the recommendations! :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.