How to delete string between two HTML tags in one row using bash script

Question

I have been working on some simple bash script recently, which parses specific data from webpages. I have used tr '\r\n' ' ' <file1.txt >file2.txt to make sure, all extracted data from page is stored in file1.txt in one row. So then I need to match all strings between <th>...</th> tags in this line and delete them or replace with ' ' sign. So here is some expamle code:

    <td>Abaktal hm</td> </tr> <tr> <th>Package</th> <td>flm 10x400 mg</td> <th>Indesit</th>

I have used sed and tried something like

    sed -i 's/\<th\>.*?\<\/th\>/ /g' output.txt

But it didn't work. I think problem is in ? sign. It works with ? sign in regular expressions, but probably not in bash.

You're using a unix variant, use one of the many languages available, such as perl, python, ruby, etc. to parse that. — Augusto
– Augusto, Commented Oct 18, 2012 at 20:13
I know that this is not the ideal solution, but solving this task is the key to finish what I am working on. So is there some form of e.g. sed command to solve this problem? Just need to select all those strings at once. — UncleSam
– UncleSam, Commented Oct 18, 2012 at 22:24

weldabar · Accepted Answer · 2012-10-19 05:51:51Z

4

While I agree with sputnick and others, the answer to your immediate question would be:

sed -ir 's/<th>[^<]+<\/th>//g'

This works on your sample data just fine.

answered Oct 19, 2012 at 5:51

weldabar

1,1638 silver badges4 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 11:51:20Z

0

 <td>
     Abaktal hm
 </td>
 <th>
     Package
 </th> 
 <td>
     flm 10x400 mg</td>
 <th> 
     Indesit
 </th>

If you have this type of input the below command will work

sed -n '//{p; :a; N; /</th>/!ba; s/.*\n//}; p' output.txt

It will delete the content between

 <th>...</th> tags

For more info removing lines between two patterns (not inclusive) with sed

edited May 23, 2017 at 11:51

CommunityBot

11 silver badge

answered Aug 20, 2015 at 10:31

Triangle

1,5173 gold badges23 silver badges37 bronze badges

Comments

Cœur · Accepted Answer · 2018-10-21 10:41:25Z

0

Your attempt seems definitely wrong.

You can't realistically parse tag-based markup languages like HTML and XML using Bash or utilities such as grep, sed or cut. If you just want to dump/render HTML, see (links|links2|lynx|w3m) -dump, html2text, vilistextum. For parsing out pieces of data, see tidy+(xmlstarlet|xmllint|xmlgawk|xpath|xml2), or learn xslt.

See

edited Oct 21, 2018 at 10:41

Cœur

39k25 gold badges206 silver badges282 bronze badges

answered Oct 18, 2012 at 20:11

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

Collectives™ on Stack Overflow

How to delete string between two HTML tags in one row using bash script

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related