0

I'm working through a scraping preparation function where pages of results lead to product pages. The function has a default maximum number of results pages, or pages per set of results, to crawl to prevent a simple mistake.

Here's what I have so far. Does the way I'm implementing the maximums with the for loops make sense? Is there a more "pythonic" way? I'm coming at this from a completely learning perspective. Thanks.

def my_crawler(url, max_pages = 1, max_items = 1):

    for page_number in range(1, max_pages + 1):
        url = url + str(page_number)
        source_code = requests.get(url).text

        products = SoupStrainer(class_ = 'productTags')
        soup = BeautifulSoup(source_code, 'html.parser', parse_only=products)

        for item_number, a in enumerate(soup.find_all('a')):
            print(str(item_number) + ': ' + a['href'])

            if item_number == max_items - 1: break

my_crawler('http://www.thesite.com/productResults.aspx?&No=')
2
  • You should consider using string formatting on url instead of just appending the number. Commented Feb 20, 2015 at 3:02
  • 1
    Your question is much more suitable for codereview.SE(considering your code actually works). Commented Feb 20, 2015 at 3:06

1 Answer 1

2

A for loop is fine, but

def my_crawler(url, max_pages = 1, max_items = 1):
    for page_number in range(1, max_pages + 1):
        url = url + str(page_number)
         ^
         |

You have changed the url parameter; the next time through the loop this will not work properly (you will seek page 1, page 12, page 123...)

Try instead

    source_code = requests.get(url + str(page_number)).text

This makes a temporary string without changing url.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.