0

I am trying to create regex to use with grep to look for lines with php variabe which has html element as value, but I am having trouble with it.

I managed to make this:

.*(\$)*(\=)*(\<).*\n?

It should match lines which have $, = and < characters.

For example:

$var = "<h1>test</h1>";

Grep command I'm using:

grep -Pro ".*(\$)*(\=)*(\<).*\n?"

And for some reason it seems to match lines like this too:

echo "</td> \n";
2
  • Please edit your question and i) show us an example of your input and the output you want to see from it and ii) if you want to know why your grep fails, show us the exact grep command you ran. Commented Aug 8, 2017 at 12:02
  • Yes, please include some more detail. Also, you do not need to match the whole line using grep. I suggest using egrep or pgrep to search using regular expressions. To match lines with $, =, and < characters you could do this: 'egrep "[\$=<]" filename.php' which would find every line containing a $,=,or < character on it. It will match </td> since there is a < char in it. Commented Aug 8, 2017 at 13:42

1 Answer 1

0

The *s after (\$) and (\=) mean, as always, zero-or-more.

The .*\n? means zero-or-more of any character(s) optionally (due to the ?, which means zero-or-one) followed by a \n.

That means that .*(\$)*(\=)*(\<).*\n? will match any line with (\<) regardless of whether it is preceded by an escaped $ and/or an = or not.

In English, the regexp reads as "zero-or-more characters, optionally followed by a $, then maybe an =, then a < (not optional), then zero-or-more characters, optionally followed by a newline."

In other words, the entire regexp, ignoring the captures, is equivalent to just <. It's the only thing in the regexp which isn't optional.

BTW, use + instead of * if you mean one-or-more.

You might want to try something more like:

grep -P '\$var\s*=\s*['"].*<[^>]+>'

That matches $var followed by zero-or-more whitespace chars, then an =, then zero-or-more whitespace again, followed by an ' or an ", then zero-or-more of any character followed by a < then any character except a >, followed finally by a >.

e.g. $var='....<h1>' would match.

Note, this won't catch any $var='htmlcode' where there's a newline between the 'var=' and any html.

2
  • Parsing HTML with regexps is difficult (actually, impossible to do reliably) even if you have an excellent grasp of the regexp language and how it works. Use an HTML parser. That's what they're for. PHP has a good selection of HTML parsing libraries to choose from - e.g. see stackoverflow.com/questions/3577641/… Commented Aug 8, 2017 at 14:11
  • on re-reading it seems you're doing this from the shell, not with PHP code. The right solution is still to use an HTML parser, but you've now got more options to choose from. hooray for choice-paralysis :) There are many for perl, python, java, and any other languages. there are even decent tools for shell scripts / CLI usage. There are many questions with good answers here on this site on this topic. Commented Aug 8, 2017 at 14:17

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.