I am trying to teach myself how to use linux tools on a cygwin install. I decided to make up a project to try to teach myself the basics of shell scripting and simultaneously accomplish some personal education. My original project was to save the html pages from every winner of the Sakharov Prize into a folder, and to write a script that would process all the html text files and return the name, year, birth and death in a hyphenated format, and country of origin. Due to certain inconsistencies in date formatting (18 July 1918 vs January 23, 1938), as well as an inability to handle dead people vs live people with no death date or figure out how to tell a computer how to recognize country names without manually listing out all the countries myself, I've basically given up on this project.
Now, I'm just trying to return the year, name, and country of origin of each recipient from the html table taken from the Sakharov Prize wikipedia page.
So, given the following sample html:
<tr>
<td>1988</td>
<td><span style="display:none;">Mandela, Nelson</span><span class="vcard"><span class="fn"><a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a></span></span></td>
<td><a href="/wiki/South_Africa" title="South Africa">South Africa</a></td>
<td>Anti-apartheid activist and later President of South Africa</td>
<td><sup id="cite_ref-twentyyears_5-0" class="reference"><a href="#cite_note-twentyyears-5"><span>[</span>5<span>]</span></a></sup></td>
</tr>
<tr>
<td>1988</td>
<td><span style="display:none;">Marchenko, Anatoly</span><span class="vcard"><span class="fn"><a href="/wiki/Anatoly_Marchenko" title="Anatoly Marchenko">Anatoly Marchenko</a></span></span> (posthumously)</td>
<td><a href="/wiki/Soviet_Union" title="Soviet Union">Soviet Union</a></td>
<td>Soviet dissident, author and humans rights activist</td>
<td><sup id="cite_ref-twentyyears_5-1" class="reference"><a href="#cite_note-twentyyears-5"><span>[</span>5<span>]</span></a></sup></td>
</tr>
what is the best way to return just the year, name, and country of origin of each recipient? Right now I'm thinking about just writing an awk script that returns everything that does not match /<*>/, but that is not exactly what I want. Can someone give me some pointers or ideas of how to pick out the names, year, and countries specifically? Or at least some books with better and more manageable sample problems than ones that I could come up with myself? None of this sounded unreasonable when I started...
cmd. For XML and HTML processing, I usexmllint,xsltproc, and Perl scripts using XML::LibXML.