Reading and returning the HTML content from a specified URL

Question

I wanted to create a simple function that can read and return the HTML content from a specified URL. This is what reading here and there lead me to:

from socket import timeout
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError

def get_html_content(url, max_attempt = 3):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    content = ""
    attempt = 1
    while True:
        try:
            html_page = urlopen(req, timeout=10)
            content = html_page.read()
        except (HTTPError, URLError, timeout) as e:
            if isinstance(e, HTTPError):
                print("The server couldn\'t fulfill the request....attempt %d/%d" % (attempt, max_attempt))
                print('Error code: ', e.code)
            if isinstance(e, URLError):
                print("We failed to reach a server....attempt %d/%d" % (attempt, max_attempt))
                print('Reason: ', e.reason)
            if isinstance(e, timeout):
                print('timeout...attempt %d/%d' % (attempt, max_attempt))
            attempt += 1
            if attempt > max_attempt:
                break
            continue
        else:
            break

    return content

I would use this function to parse the content of many URLs. For if content = "", I would raise a random exception after writing to some file whatever I had already successfully gathered.

you may want to consider using requests, possibly with an extension to account for the multiple attempts, or just iterating as you did) — WoJ
– WoJ, Commented Jul 28, 2018 at 19:24

janos · Accepted Answer · 2018-07-28 19:15:24Z

There are a couple of minor technical issues:

The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.
In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.
Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.
You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    for attempt in range(max_attempt):
        try:
            return urlopen(req, timeout=10).read()
        except HTTPError as e:
            print("The server couldn\'t fulfill the request....attempt {}/{}".format(attempt + 1, max_attempt))
            print('Error code: ', e.code)
        except URLError as e:
            print("We failed to reach a server....attempt {}/{}".format(attempt + 1, max_attempt))
            print('Reason: ', e.reason)
        except timeout as e:
            print('timeout...attempt {}/{}'.format(attempt + 1, max_attempt))

Stack Exchange Network

Reading and returning the HTML content from a specified URL

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Reading and returning the HTML content from a specified URL

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions