2
\$\begingroup\$

I'm working through a scraping function where pages of results lead to product pages. I've added a default maximum number of results pages, and pages per set of results, to prevent a simple mistake of crawling the whole site.

Here's what I have so far. Does the way I'm implementing the maximums with the for loops make sense? Is there a more "pythonic" way? I'm coming at this from a completely learning perspective.

def my_crawler(url, max_pages = 1, max_items = 1):

    for page_number in range(1, max_pages + 1):
        source_code = requests.get(url + str(page_number)).text

        products = SoupStrainer(class_ = 'productTags')
        soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)

        for item_number, a in enumerate(soup.find_all('a')):
            print(str(item_number) + ': ' + a['href'])

            if item_number == max_items - 1: break

my_crawler('http://www.thesite.com/productResults.aspx?&No=')
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

There's not a lot to say about such a simple program, mainly nitpicks.

You are violating PEP8 at a few places. The code reformatted to follow PEP8 would look like this:

def my_crawler(url, max_pages=1, max_items=1):
    for page_number in range(1, max_pages + 1):
        source_code = requests.get(url + str(page_number)).text

        products = SoupStrainer(class_='productTags')
        soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)

        for item_number, a in enumerate(soup.find_all('a')):
            print(str(item_number) + ': ' + a['href'])

            if item_number == max_items - 1:
                break

In particular:

  • Write method parameters compactly as max_pages=1 instead of max_pages = 1
  • Break line after :

Instead of range(1, max_pages + 1) I would find range(max_pages) simpler if you only need to do max_pages + 1 in the code. So I'd rewrite the above as:

for page_number in range(max_pages):
    source_code = requests.get(url + str(page_number + 1)).text

And as a mild case of paranoia, I'd prefer the if statement for breaking out of the loop this way:

if item_number >= max_items - 1:
    break

And yes, it's fine to break out of the loop this way. I don't think there's a better way.

When you print the item number and the link, I think this is slightly more readable:

print('{}: {}'.format(item_number, a['href']))

Finally, don't execute code in global scope, it's recommended to wrap code in an if __name__ == '__name__': guard:

if __name__ == '__name__':
    my_crawler('http://www.thesite.com/productResults.aspx?&No=')

These are mainly just nitpicks though. You're doing fine.

\$\endgroup\$
2
  • 1
    \$\begingroup\$ Awesome. Thanks @janos. This is exactly the kind of feedback I was looking for. Is the paranoia for item_number >= max_items from experiencing a bug or just good practice? \$\endgroup\$ Commented Feb 21, 2015 at 20:06
  • 1
    \$\begingroup\$ I'm not aware if it's a known "good practice". I can't recall experiencing a bug, but then again, I've always been this way so maybe that's why it never happened to me. \$\endgroup\$ Commented Feb 21, 2015 at 20:15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.