Skip to main content
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593https://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593https://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found https://stackoverflow.com/q/6096327/789593 and https://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

replaced http://unix.stackexchange.com/ with https://unix.stackexchange.com/
Source Link

As has been mentioned regex is not good for parsing htmlregex is not good for parsing html. Similar to another parse answerparse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

Rollback to Revision 1
Source Link
N.N.
  • 2k
  • 16
  • 18

As has been mentioned regex is not good for parsing htmlregex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

The link was linking to the present question. I think the author wanted to the utmost famous **don't parse html with regex answer** SO answer
Source Link
Loading
Source Link
N.N.
  • 2k
  • 16
  • 18
Loading