Revisions to Struggles with grep, sed, awk to filter html

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:40

1

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 https://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 https://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found https://stackoverflow.com/q/6096327/789593 and https://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

replaced http://unix.stackexchange.com/ with https://unix.stackexchange.com/

Source Link

edited Apr 13, 2017 at 12:37

Community Bot

1

As has been mentioned regex is not good for parsing html regex is not good for parsing html. Similar to another parse answer parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

Rollback to Revision 1

Source Link

edited Apr 12, 2013 at 9:48

N.N.

2k
16
18

As has been mentioned regex is not good for parsing html regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)

As has been mentioned regex is not good for parsing html. Similar to another parse answer you can make a Ruby one-liner such as the following to do it for you. Note that it requires Nokogiri which you can install as a gem (sudo gem install nokogiri).

ruby -rnokogiri -e 'Nokogiri::HTML(readlines.join).css("tr").each { |tr| tr.xpath(".//td").take(3).each { |td| puts td.content } }' sample.html

It reads the given file, in this case sample.html, gets all tr elements and for each such element it prints the content of the first three td elements.

For your sample it will output:

1988
Mandela, NelsonNelson Mandela
South Africa
1988
Marchenko, AnatolyAnatoly Marchenko (posthumously)
Soviet Union

The problem is the lines which contains the names twice, e.g. (formatted to be easier to read)

<td>
  <span style="display:none;">Mandela, Nelson</span>
  <span class="vcard"><span class="fn">
      <a href="/wiki/Nelson_Mandela" title="Nelson Mandela">Nelson Mandela</a>
    </span>
  </span>
</td>

in which the name is first in a span with style="display:none;" and then in again in another span. I am not sure how to extract only the name that is not within an element with style="display:none;. (I have found http://stackoverflow.com/q/6096327/789593 and http://stackoverflow.com/q/11602077/789593 but they do not describe the right tecnique. Perhaps someone can come up with a fix via http://nokogiri.org/Nokogiri/XML/Node.html?)