Are there better ways to approach these Python loops?

Question

I'm working through a scraping preparation function where pages of results lead to product pages. The function has a default maximum number of results pages, or pages per set of results, to crawl to prevent a simple mistake.

Here's what I have so far. Does the way I'm implementing the maximums with the for loops make sense? Is there a more "pythonic" way? I'm coming at this from a completely learning perspective. Thanks.

def my_crawler(url, max_pages = 1, max_items = 1):

    for page_number in range(1, max_pages + 1):
        url = url + str(page_number)
        source_code = requests.get(url).text

        products = SoupStrainer(class_ = 'productTags')
        soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)

        for item_number, a in enumerate(soup.find_all('a')):
            print(str(item_number) + ': ' + a['href'])

            if item_number == max_items - 1: break

my_crawler('http://www.thesite.com/productResults.aspx?&No=')

You should consider using string formatting on url instead of just appending the number. — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Feb 20, 2015 at 3:02
Your question is much more suitable for codereview.SE(considering your code actually works). — Ashwini Chaudhary
– Ashwini Chaudhary, Commented Feb 20, 2015 at 3:06

Hugh Bothwell · Accepted Answer · 2015-02-20 03:07:25Z

2

A for loop is fine, but

def my_crawler(url, max_pages = 1, max_items = 1):
    for page_number in range(1, max_pages + 1):
        url = url + str(page_number)
         ^
         |

You have changed the url parameter; the next time through the loop this will not work properly (you will seek page 1, page 12, page 123...)

Try instead

    source_code = requests.get(url + str(page_number)).text

This makes a temporary string without changing url.

answered Feb 20, 2015 at 3:07

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Are there better ways to approach these Python loops?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related