Extracting text between 2 specific strings with multiple occurrences in bash

Question

I have a big xhtml file with lots of junk text that I don't need. I only need whatever text that lies between two specific strings that occur many times within that file, e.g.

<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>

My output should be:

important text1
important text2
important text3

I need to do that using Bash script.

Thanks for your help

Please note that <xyz> and </xyz> are not fixed strings. There are a lot of different unneeded tags. — Ahmed Tawfik
– Ahmed Tawfik, Commented Apr 25, 2016 at 8:47

Kent · Accepted Answer · 2016-04-25 09:03:48Z

2

Using regex on Xml format is risky, particularly with line based text processing tool grep. You cannot make sure that the result is always correct.

If your input was valid xml format, I would go with xml way: xpath expression.

With tool xmlstarlet, you can do:

xmlstarlet sel -t -v "//mytag/text()" file.xml

It gives the desired output.

You can also do it with xmllint, however, you need do some further filtering on the output.

answered Apr 25, 2016 at 9:03

Kent

196k36 gold badges248 silver badges316 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ahmed Tawfik Over a year ago

Thanks Kent, but the problem is that it's an XHTML file, so it's not well formatted at all, will xmllint be able to handle that?

Kent Over a year ago

@SoCRaT Extensible Hypertext Markup Language (XHTML) is part of the family of XML markup languages

heemayl · Accepted Answer · 2016-04-25 08:49:53Z

0

Using an XML parser would be the best way to go.

Solution using grep with PCRE:

grep -Po '^<mytag>\s*\K.*?(?=\s*</mytag>$)'

Example:

$ cat file.xml                                    
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>

$ grep -Po '^<mytag>\s*\K.*?(?=\s*</mytag>$)' file.xml
important text1
important text2
important text3

answered Apr 25, 2016 at 8:49

heemayl

42.5k10 gold badges86 silver badges87 bronze badges

1 Comment

Ahmed Tawfik Over a year ago

Thanks a lot. I will test and get back to you.

riteshtch · Accepted Answer · 2016-04-25 08:50:40Z

Using XML parser is a better approach, there are also command line tools for xml parsing in Linux, eg: xmllint but you can do it using grep like this:

$ cat data1 
<html>
<xyz> unneeded text </xyz>
<mytag> important text1 </mytag>
<xyz> unneeded text </xyz>
<xyz> unneeded text </xyz>
<mytag> important text2 </mytag>
<mytag> important text3 </mytag>
<xyz> unneeded text </xyz>
</html>
$ grep -oP '(?<=<mytag>).*(?=</mytag>)' data1
 important text1 
 important text2 
 important text3  
$

(?<=<mytag>).*(?=</mytag>) this extracts text using positive lookahead and lookbehind assertions

Collectives™ on Stack Overflow

Extracting text between 2 specific strings with multiple occurrences in bash

3 Answers 3

2 Comments

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

1 Comment

Related