Struggles with grep, sed, awk to filter html

Question

I am trying to teach myself how to use linux tools on a cygwin install. I decided to make up a project to try to teach myself the basics of shell scripting and simultaneously accomplish some personal education. My original project was to save the html pages from every winner of the Sakharov Prize into a folder, and to write a script that would process all the html text files and return the name, year, birth and death in a hyphenated format, and country of origin. Due to certain inconsistencies in date formatting (18 July 1918 vs January 23, 1938), as well as an inability to handle dead people vs live people with no death date or figure out how to tell a computer how to recognize country names without manually listing out all the countries myself, I've basically given up on this project.

Now, I'm just trying to return the year, name, and country of origin of each recipient from the html table taken from the Sakharov Prize wikipedia page.

So, given the following sample html:

<tr>
<td>1988</td>
<td><span style="display:none;">Mandela, Nelson</span><span class="vcard"><span class="fn"><a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a></span></span></td>
<td><a href="/wiki/South_Africa" title="South Africa">South Africa</a></td>
<td>Anti-apartheid activist and later President of South Africa</td>
<td><sup id="cite_ref-twentyyears_5-0" class="reference"><a href="#cite_note-twentyyears-5"><span>[</span>5<span>]</span></a></sup></td>
</tr>
<tr>
<td>1988</td>
<td><span style="display:none;">Marchenko, Anatoly</span><span class="vcard"><span class="fn"><a href="/wiki/Anatoly_Marchenko" title="Anatoly Marchenko">Anatoly Marchenko</a></span></span> (posthumously)</td>
<td><a href="/wiki/Soviet_Union" title="Soviet Union">Soviet Union</a></td>
<td>Soviet dissident, author and humans rights activist</td>
<td><sup id="cite_ref-twentyyears_5-1" class="reference"><a href="#cite_note-twentyyears-5"><span>[</span>5<span>]</span></a></sup></td>
</tr>

what is the best way to return just the year, name, and country of origin of each recipient? Right now I'm thinking about just writing an awk script that returns everything that does not match /<*>/, but that is not exactly what I want. Can someone give me some pointers or ideas of how to pick out the names, year, and countries specifically? Or at least some books with better and more manageable sample problems than ones that I could come up with myself? None of this sounded unreasonable when I started...

regex is not the proper tool for parsing tag-based markup languages such as HTML. — jordanm
– jordanm, Commented Mar 20, 2013 at 5:12
Ditto. Which is not to say that it can't be done in this particular case or any particular case, but in general mark-up languages should be parsed with a parser designed for the task. This is because a method not based on decomposing the structure formally will become nothing but a headache as you try to apply it to more and more generalized cases. Using a particular case as an exercise will not teach you good habits. Crass analogy: managing to fix a car with dinner utensils is not a worthwhile exercise, even if you did it "this time" [quite the SO case study in jordanm's link btw...] — goldilocks
– goldilocks, Commented Mar 20, 2013 at 5:52
Agreed, and on top of that I find cygwin to be more of a hard-core enthusiast type of environment than a productive environment. Particularly as a self-professed newbie you now need to wrestle the peculiarities and eccentricities (Is that a word?) of Cygwin while at the same time trying to learn something about the tools. I highly suggest a full Linux installation in a VM like VirtualBox, or even a second PC under the desk to which you can connect. — Johan
– Johan, Commented Mar 20, 2013 at 8:11
I use Cygwin every day, nothing wrong with it. It's a lot more convenient than cmd. For XML and HTML processing, I use xmllint, xsltproc, and Perl scripts using XML::LibXML. — reinierpost
– reinierpost, Commented Sep 15, 2016 at 16:02

Community · Accepted Answer · 2017-05-23 12:40:03Z

3

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found https://stackoverflow.com/q/6096327/789593 and https://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

edited May 23, 2017 at 12:40

CommunityBot

1

answered Mar 20, 2013 at 8:42

N.N.

2,02016 silver badges18 bronze badges

Appreciate it. I'll grab the ruby package and give it a shot today and let you know how it goes.

tzisc
– tzisc

2013-03-20 17:37:35 +00:00
Commented Mar 20, 2013 at 17:37
@tzisc If you are on a Debian based system sudo apt-get install ruby1.9.1-dev should be enough install the ruby stuff you need then you can grab nokogiri as I described in the answer.

N.N.
– N.N.

2013-03-20 20:15:29 +00:00
Commented Mar 20, 2013 at 20:15
So I got the ruby package on my ubuntu vps and messed around for a bit with nokogiri. I think my lack of understanding of ruby syntax is holding me back a bit, as I don't really understand the tutorials that they have on the site, but you gave me the insight of trying to use a jquery script to try and do this. I think I'm on the right track now, so thank you.

tzisc
– tzisc

2013-03-21 20:43:18 +00:00
Commented Mar 21, 2013 at 20:43
@tzisc Ruby is not hard to learn. Maybe you just lack the understanding of its iterators, see e.g. www-rohan.sdsu.edu/doc/ruby/chp_03/iterators.html. Otherwise, see ruby-doc.org/docs/ProgrammingRuby for learning Ruby as a whole.

N.N.
– N.N.

2013-03-22 05:51:08 +00:00
Commented Mar 22, 2013 at 5:51

Add a comment |

welldan97 · Accepted Answer · 2013-04-11 15:19:56Z

I have created node.js package which can be used here: gumba. It's kind of awk, sed replacement.

so in your example it will work like this:

cat file.html | gumba "stripTags()"

which outputs:

1988
Mandela, NelsonNelson Mandela
South Africa
Anti-apartheid activist and later President of South Africa
[5]


1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union
Soviet dissident, author and humans rights activist
[5]

Although here I think it's better not to use oneliners, but to actually write script in any language you know.

mug896 · Accepted Answer · 2017-01-24 22:37:04Z

0

sed -rn '
    /<tr>/ {
        n
        s#<td>([^<]*)</td>#\1#
        h
        n
        s#<td><span[^>]*>([^<]*)</span>.*#\1#
        H
        n
        s#<td><a href=[^>]*>([^<]*)</a>.*#\1#
        H
        x;p
    }
' file

1988
Mandela, Nelson
South Africa
1988
Marchenko, Anatoly
Soviet Union

edited Jan 24, 2017 at 22:37

answered Jan 24, 2017 at 22:16

mug896

1,00510 silver badges12 bronze badges

Add a comment |

Stack Exchange Network

Struggles with grep, sed, awk to filter html

3 Answers 3

You must log in to answer this question.

Linked

Hot Network Questions

Struggles with grep, sed, awk to filter html

3 Answers 3

You must log in to answer this question.

Linked

Related

Hot Network Questions