Skip to main content
added 12 characters in body
Source Link
janos
  • 113.1k
  • 15
  • 154
  • 396

There are a couple of minor technical issues:

  • The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.

  • In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.

  • Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.

  • You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    for attempt in range(max_attempt):
        try:
            return urlopen(req, timeout=10).read()
        except HTTPError as e:
            print("The server couldn\'t fulfill the request....attempt %d{}/%d" % {}".format(attempt + 1, max_attempt))
            print('Error code: ', e.code)
        except URLError as e:
            print("We failed to reach a server....attempt %d{}/%d" % {}".format(attempt + 1, max_attempt))
            print('Reason: ', e.reason)
        except timeout as e:
            print('timeout...attempt %d{}/%d' % {}'.format(attempt + 1, max_attempt))

There are a couple of minor technical issues:

  • The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.

  • In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.

  • Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.

  • You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    for attempt in range(max_attempt):
        try:
            return urlopen(req, timeout=10).read()
        except HTTPError as e:
            print("The server couldn\'t fulfill the request....attempt %d/%d" % (attempt + 1, max_attempt))
            print('Error code: ', e.code)
        except URLError as e:
            print("We failed to reach a server....attempt %d/%d" % (attempt + 1, max_attempt))
            print('Reason: ', e.reason)
        except timeout as e:
            print('timeout...attempt %d/%d' % (attempt + 1, max_attempt))

There are a couple of minor technical issues:

  • The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.

  • In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.

  • Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.

  • You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    for attempt in range(max_attempt):
        try:
            return urlopen(req, timeout=10).read()
        except HTTPError as e:
            print("The server couldn\'t fulfill the request....attempt {}/{}".format(attempt + 1, max_attempt))
            print('Error code: ', e.code)
        except URLError as e:
            print("We failed to reach a server....attempt {}/{}".format(attempt + 1, max_attempt))
            print('Reason: ', e.reason)
        except timeout as e:
            print('timeout...attempt {}/{}'.format(attempt + 1, max_attempt))
Source Link
janos
  • 113.1k
  • 15
  • 154
  • 396

There are a couple of minor technical issues:

  • The content variable is unnecessary, because you can simply return html_page.read() directly. (And you could as well return urlopen(req, timeout=10).read() directly...) When the max attempts are reached, you could return "" instead of relying on that content was initialized to "". And how about returning None? Then you could simply omit the return statement to the same effect.

  • In the exception handling, there are multiple if statements with conditions that are mutually exclusive, only one can match at a time. In such situation you should chain them together with elif.

  • Instead of doing a single except statement with multiple error types and then using conditionals to identify the correct one, it would be better to use multiple except statements each with a single error type.

  • You could iterate using range for slightly more compact code.

Like this:

def get_html_content(url, max_attempt = 3):
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

    for attempt in range(max_attempt):
        try:
            return urlopen(req, timeout=10).read()
        except HTTPError as e:
            print("The server couldn\'t fulfill the request....attempt %d/%d" % (attempt + 1, max_attempt))
            print('Error code: ', e.code)
        except URLError as e:
            print("We failed to reach a server....attempt %d/%d" % (attempt + 1, max_attempt))
            print('Reason: ', e.reason)
        except timeout as e:
            print('timeout...attempt %d/%d' % (attempt + 1, max_attempt))