Extracting and normalizing URLs in an HTML document

Question

I have written code to get all urls on a webpage & put them in a set, and would like tips on simple changes I can make to increase its performance.

soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    url = link.get('href')
    if url is None or ' ' in url or '<' in url or '>' in url:
        continue
    if url.startswith('//'):
        url = url.replace('//', 'http://')
    if url.startswith('/'):
        url = hostname + url
    if '?' in url:
        url = url.split('?')[0]
    if '#' in url:
        url = url.split('#')[0]
    if url.endswith('/'):
        url = url[:-1]
    if url.endswith(excluded_extensions):
        continue
    if url.startswith(hostname):
        urls_set.add(url)

@sunny urllib.parse doesn't work for me because "urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component." link — Waqqas Dadabhoy
– Waqqas Dadabhoy, Commented Aug 11, 2015 at 13:32

Jaime · Accepted Answer · 2015-08-11 04:14:29Z

Some stuff you could perhaps do differently:

# your code
if url is None or ' ' in url or '<' in url or '>' in url:
    continue

# the alternative
if url is None or any(char in url for char in ' <>'):
    continue

Also, you can call the split method directly, without the if statement, as it will return a single item list with the full string inside if the character is not in the string:

# your code
if '?' in url:
    url = url.split('?')[0]
if '#' in url:
    url = url.split('#')[0]

# the alternative
for splitter in '?#':
    url = url.split(splitter, 1)[0]

Notice the micro-optimization of using the second argument of split, so that the string is only split at the first occurrence if there is more than one.

200_success · Accepted Answer · 2015-08-10 16:26:36Z

1

url.replace('//', 'http://') is not quite right: it does a global replacement, but you intend to replace only the leading //. You should write url.replace('//', 'http://', 1) instead. Or better yet, url = 'http:' + url.

answered Aug 10, 2015 at 16:26

200_success

146k22 gold badges191 silver badges481 bronze badges

Add a comment |

Stack Exchange Network

Extracting and normalizing URLs in an HTML document

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Extracting and normalizing URLs in an HTML document

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions