There are a couple of minor technical issues:
The
contentvariable is unnecessary, because you can simply returnhtml_page.read()directly. (And you could as well returnurlopen(req, timeout=10).read()directly...) When the max attempts are reached, you couldreturn ""instead of relying on thatcontentwas initialized to"". And how about returningNone? Then you could simply omit thereturnstatement to the same effect.In the exception handling, there are multiple
ifstatements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together withelif.Instead of doing a single
exceptstatement with multiple error types and then using conditionals to identify the correct one, it would be better to use multipleexceptstatements each with a single error type.You could iterate using
rangefor slightly more compact code.
Like this:
def get_html_content(url, max_attempt = 3):
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
for attempt in range(max_attempt):
try:
return urlopen(req, timeout=10).read()
except HTTPError as e:
print("The server couldn\'t fulfill the request....attempt %d{}/%d" % {}".format(attempt + 1, max_attempt))
print('Error code: ', e.code)
except URLError as e:
print("We failed to reach a server....attempt %d{}/%d" % {}".format(attempt + 1, max_attempt))
print('Reason: ', e.reason)
except timeout as e:
print('timeout...attempt %d{}/%d' % {}'.format(attempt + 1, max_attempt))