Format HTML code with Python

Question

I have a list of URLs in a column in a CSV-file. I would like to use Python to go through all the URLs, download a specific part of the HTML code from the URL and save it to the next column.

For example: From this URL I would like to extract this div and write it to the next column.

<div class="info-holder" id="product_bullets_section">
<p>
VM−2N ist ein Hochleistungs−Verteilverstärker für Composite− oder SDI−Videosignale und unsymmetrisches Stereo−Audio. Das Eingangssignal wird entkoppelt und isoliert, anschließend wird das Signal an zwei identische Ausgänge verteilt.
<span id="decora_msg_container" class="visible-sm-block visible-md-block visible-xs-block visible-lg-block"></span>
</p>
<ul>
<li>
<span>Hohe Bandbreite — 400 MHz (–3 dB).</span>
</li>
<li>
<span>Desktop–Grösse — Kompakte Bauform, zwei Geräte können mithilfe des optionalen Rackadapters RK–1 in einem 19 Zoll Rack auf 1 HE nebeneinander montiert werden.</span>
</li>
</ul>
</div>

I have this code, the HTML code is saved in the variable html:

import csv
import urllib.request

with open("urls.csv", "r", newline="", encoding="cp1252") as f_input:
    csv_reader = csv.reader(f_input, delimiter=";", quotechar="|")
    header = next(csv_reader)
    items = [row[0] for row in csv_reader]

with open("results.csv", "w", newline="") as f_output:
    csv_writer = csv.writer(f_output, delimiter=";")
    for item in items:
        html = urllib.request.urlopen(item).read()

Currently the HTML-Code is pretty ugly. How could I delete everything out of the variable html except the div I would like to extract?

etaloof · Accepted Answer · 2017-12-12 15:58:36Z

3

Given that all of your webpages are have the same structure you can parse the html with this code. It will look for the first div with the id product_bullets_section. An id in html should be unique but the given website has two equal id's so we obtain the first one through slicing and convert the parsed div back to a string containing your html.

import csv
import urllib.request

from bs4 import BeautifulSoup

with open("urls.csv", "r", newline="", encoding="cp1252") as f_input:
    csv_reader = csv.reader(f_input, delimiter=";", quotechar="|")
    header = next(csv_reader)
    items = [row[0] for row in csv_reader]

items = ['https://www.kramerav.com/de/Product/VM-2N']
with open("results.csv", "w", newline="") as f_output:
    csv_writer = csv.writer(f_output, delimiter=";")
    for item in items:
        html = urllib.request.urlopen(item).read()
        the_div = str(BeautifulSoup(html).select('div#product_bullets_section')[0])

answered Dec 12, 2017 at 15:58

etaloof

6729 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

dun Over a year ago

Hi thank you so much for your help! Sometimes the sites in my list don't have the div I'm searching for. Python exits then with this error:

File "/home/dun/_workspace/py/search-articles/save-html.py", line 15, in <module> div = str(BeautifulSoup(html).select("div#product_bullets_section"‌)[0]) IndexError: list index out of range

Do you know how I could just write a space in the row and use the next URL afterwards?

etaloof Over a year ago

For this I would need to know the rest of you code specifically the part dealing with the file. But you can use a try-except block to write the space if a site doesn't have this div. Maybe it will look like this:try: the_div = str(BeautifulSoup(html).select('div#product_bullets_section')[0]) except IndexError: the_div = '' finally: f_output.write(the_div)

dun Over a year ago

Thank you so much! I used except IndexError: div="". Everything works now.

etaloof Over a year ago

I'm glad that I could help you.

Tanphi · Accepted Answer · 2017-12-12 15:56:13Z

2

In this example, you can use BeautifulSoup to get the div with a specific id:

import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup(html)
div = soup.find(id="product_bullets_section")

answered Dec 12, 2017 at 15:56

Tanphi

1461 silver badge6 bronze badges

Comments

Rushikumar · Accepted Answer · 2017-12-12 15:59:13Z

Why not use html.parser - Simple HTML and XHTML parser?

Example:

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        for attr in attrs:
            print("     attr:", attr)

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

    parser = MyHTMLParser()

and then use parser.feed(data) (where data is a str)

Collectives™ on Stack Overflow

Format HTML code with Python

3 Answers 3

4 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Linked

Related