3

I have been working on some simple bash script recently, which parses specific data from webpages. I have used tr '\r\n' ' ' <file1.txt >file2.txt to make sure, all extracted data from page is stored in file1.txt in one row. So then I need to match all strings between <th>...</th> tags in this line and delete them or replace with ' ' sign. So here is some expamle code:

    <td>Abaktal hm</td> </tr> <tr> <th>Package</th> <td>flm 10x400 mg</td> <th>Indesit</th>

I have used sed and tried something like

    sed -i 's/\<th\>.*?\<\/th\>/ /g' output.txt

But it didn't work. I think problem is in ? sign. It works with ? sign in regular expressions, but probably not in bash.

3
  • 2
    It's a bad idea to parse html with shell. Commented Oct 18, 2012 at 20:08
  • You're using a unix variant, use one of the many languages available, such as perl, python, ruby, etc. to parse that. Commented Oct 18, 2012 at 20:13
  • I know that this is not the ideal solution, but solving this task is the key to finish what I am working on. So is there some form of e.g. sed command to solve this problem? Just need to select all those strings at once. Commented Oct 18, 2012 at 22:24

3 Answers 3

4

While I agree with sputnick and others, the answer to your immediate question would be:

sed -ir 's/<th>[^<]+<\/th>//g'

This works on your sample data just fine.

Sign up to request clarification or add additional context in comments.

Comments

0
 <td>
     Abaktal hm
 </td>
 <th>
     Package
 </th> 
 <td>
     flm 10x400 mg</td>
 <th> 
     Indesit
 </th>

If you have this type of input the below command will work

sed -n '//{p; :a; N; /</th>/!ba; s/.*\n//}; p' output.txt

It will delete the content between

 <th>...</th> tags

For more info removing lines between two patterns (not inclusive) with sed

Comments

0

Your attempt seems definitely wrong.

You can't realistically parse tag-based markup languages like HTML and XML using Bash or utilities such as grep, sed or cut. If you just want to dump/render HTML, see (links|links2|lynx|w3m) -dump, html2text, vilistextum. For parsing out pieces of data, see tidy+(xmlstarlet|xmllint|xmlgawk|xpath|xml2), or learn xslt.

See

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.