2

I am parsing a text log, where each line contains an id closed in parenthesis and one or more (possibly hundreds) chunks of data (alphanumeric, always 20 chars), such as this:

id=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(702832), data1=(Ba6FGoP5Dzxwmb6JhJ5a)

At this point of the program, I am not interested about the data, just about quick fetching of all the ids. The problem is, that due to the noisy communication channel an error may appear denoted by string Error that can be anywhere on the line. The goal is to ignore these lines.

What worked for me so far was a simple negative lookahead:

^id=\((\d+)\),(?!.*Error)

But I forgot, that there is some tiny probability, that this Error string may actually appear as a valid sequence of characters somewhere in the data, which has backfired on me just now.

The only way to distinguish between valid and invalid appearance of the Error string in the data chunk is to check for the length. If it's 20 characters, then it was this rare valid occurrence and I want to keep it (unless the Error is elsewhere on the line), if it's longer, I want to discard the line.

Is it still possible to treat this situation with a regular expression or is it already too much for the regex monster?

Thanks a lot.

Edit: Adding examples of error lines - all these should be ignored.

iErrord=(702831), data1=(Ub9fS97Hkc570Vvqkdy1), data2=(Hd7t553df8mnOa84wTcF)
id=(7028Error32), data1=(Ba6FGoP5Dzxwmb6JhJ5a)
id=(702833), daErrorta1=(hF6eDpLxbnFS5PfKaCds)
id=(702834), data1=(bx5EsH7BCsk6dMzpQDErrorKA)

However this one should not be ignored, the Error is just incidently contained in the data part, but it currently is ignored

id=(702834), data1=(bx5EsH6dMzpQDErrorKA)
4
  • Can you add examples of logs with Errors ? Commented Jan 23, 2014 at 15:16
  • OK, added some examples. Commented Jan 23, 2014 at 15:19
  • Is there a reason you need this in one regex? Can't you just first ignore all lines that have Error in them and then only allow the ones having valid id=([0-9]+) in them? Commented Jan 23, 2014 at 15:23
  • No, it can be in two. But still you face the same problem imho - that you cannot simply ignore all lines containing Error. Commented Jan 23, 2014 at 15:27

2 Answers 2

1

Alright, it's not exactly what you were thinking about, but here's a suggestion :

Can't you simply match the lines following the pattern, undisturbed by an Error somewhere ?

Here's the regexp that'll do it :

^id=\((\d+)\), (data\d+=\([a-zA-Z\d]{20}\)(, )?)+$

If Error is anywhere on the line (except in the middle of the chunk of data), the regexp will not match it, so you get the wanted result, it'll be ignored.

If this doesn't please you, you have to add more lookahead and lookbehind groups. I'll try to do that and edit if I write a good regexp.

Sign up to request clarification or add additional context in comments.

2 Comments

No, no, this works perfectly! Just when the Error is at the end of the line, it still is accepted, but that can be easily fixed by adding $ at the end of the regex. I was also thinking about somehow checking the length, but was unable to do it for any amount of chunks and yet it's this simple. Thanks!
You are right about the end of the line ! I added the $ in my answer.
1

Since your chunks of data are always 20 characters long, if one is 25 characters this means there is an error in it. Therefore you could check if there is a chunk of such a length, then check if there is Error outside of parenthesis. If so, you shouldn't match the line. If not, it valid.

Something like

(?![^)]*Error)id=\((\d+)(?!.*(?:\(.{25}\)|\)[^(]*Error))

might do the trick.

4 Comments

Jsut saw Theox's answer, probably way more elegant if the pattern of your string is fixed indeed :)
Still I appreciate it, thanks! Actually I was trying to construct something like that (forgetting I can make things simpler, such as is shown by Theox), but my regex skills are way too small for that. So if I got it well, you are ensuring, that after the id=(123) part there is NOT a sequence of 25 chars closed in () and there is NOT an Error string outside the parenthesis. Correct?
Yep, ensuring that there no chunk of 25 characters OR an Error outside parameters. But as usual in regex, half the work is asking oneself the right question!
Nice and straightforward. Thanks!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.