4
\$\begingroup\$

I have written code to get all urls on a webpage & put them in a set, and would like tips on simple changes I can make to increase its performance.

soup = BeautifulSoup(html_doc)
for link in soup.find_all('a'):
    url = link.get('href')
    if url is None or ' ' in url or '<' in url or '>' in url:
        continue
    if url.startswith('//'):
        url = url.replace('//', 'http://')
    if url.startswith('/'):
        url = hostname + url
    if '?' in url:
        url = url.split('?')[0]
    if '#' in url:
        url = url.split('#')[0]
    if url.endswith('/'):
        url = url[:-1]
    if url.endswith(excluded_extensions):
        continue
    if url.startswith(hostname):
        urls_set.add(url)
\$\endgroup\$
2
  • \$\begingroup\$ Because you don't want to use the urllib.parse module? \$\endgroup\$ Commented Aug 11, 2015 at 4:21
  • \$\begingroup\$ @sunny urllib.parse doesn't work for me because "urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component." link \$\endgroup\$ Commented Aug 11, 2015 at 13:32

2 Answers 2

3
\$\begingroup\$

Some stuff you could perhaps do differently:

# your code
if url is None or ' ' in url or '<' in url or '>' in url:
    continue

# the alternative
if url is None or any(char in url for char in ' <>'):
    continue

Also, you can call the split method directly, without the if statement, as it will return a single item list with the full string inside if the character is not in the string:

# your code
if '?' in url:
    url = url.split('?')[0]
if '#' in url:
    url = url.split('#')[0]

# the alternative
for splitter in '?#':
    url = url.split(splitter, 1)[0]

Notice the micro-optimization of using the second argument of split, so that the string is only split at the first occurrence if there is more than one.

\$\endgroup\$
1
\$\begingroup\$

url.replace('//', 'http://') is not quite right: it does a global replacement, but you intend to replace only the leading //. You should write url.replace('//', 'http://', 1) instead. Or better yet, url = 'http:' + url.

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.