Optimizing my Python Scraper

Question

Kind of a long winded question and I probably just need someone to point me in the right direction. I'm building a web scraper to grab basketball player info from ESPN's website. The URL structure is pretty simple in that each player card has a specific id in the URL. To obtain information I'm writing a loop from 1-~6000 to grab players from their database. My question is whether there is a more efficient way of doing this?

from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests 
import nltk
import re




age = [] # Empty List to store player ages

BASE = 'http://espn.go.com/nba/player/stats/_/id/' # Base Structure of Player Card URL
def get_age(BASE): #Creates a function
    #z = range(1,6000) # Create Range from 1 to 6000
    for i in range(1, 6000): # This is a for loop
        BASE_U = BASE + str(i) + '/' # Create URL For Player   
        r = requests.get(BASE_U)
        soup = BeautifulSoup(r.text)
        #Prior to this step, I had to print out the soup object and look through the HTML in order to find the tag that contained my desired information 
        # Get Age of Players        
        age_tables = soup.find_all('ul', class_="player-metadata") # Grabs all text in the metadata tag
        p = str(age_tables) # Turns text into a string
    #At this point I had to look at all the text in the p object and determine a way to capture the age info
        if "Age: " not in p: # PLayer ID doesn't exist so go to next to avoid error
        continue
        else:
            start = p.index("Age: ") + len("Age: ") # Gets the location of the players age 
            end = p[start:].index(")") + start  
            player_id.append(i) #Adds player_id to player_id list
            age.append(p[start:end]) # Adds player's age to age list

get_age(BASE)

Any help, even small, would be much appreciated. Even if it's just pointing me in the right direction, and not necessarily a direct solution

Thanks, Ben

Eric Shaw · Accepted Answer · 2015-06-21 01:12:37Z

1

It's like the port scaner in network security, multi-thread will faster you program very much.

answered Jun 21, 2015 at 1:12

Eric Shaw

692 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ben890 Over a year ago

Ah I've heard about multi-threading. Do you know of easy to follow online tutorials?

Samuel Taylor Over a year ago

I personally thought the documentation for the multiprocessing library was a good place to start. You might look into guides for that library if the documentation isn't good enough for you.

alecxe · Accepted Answer · 2015-06-21 01:31:20Z

1

Not only more efficient, but also a more organized and scalable approach would involve switching to Scrapy web-scraping framework.

The main performance problem you have is because of the "blocking" nature of your current approach - Scrapy would solve it out-of-the-box because it is based on twisted and is completely asynchronous.

edited Jun 21, 2015 at 1:31

answered Jun 21, 2015 at 1:21

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

2 Comments

alecxe Over a year ago

@Ben sure, let me know if you need help making a scrapy spider. You would be surprised how easy would it be to start working with Scrapy.

ben890 Over a year ago

Thanks alecxe! I may reach out

CLaFarge · Accepted Answer · 2015-06-21 03:37:01Z

I'd probably start with http://espn.go.com/nba/players and use the following Regular Expression to get the Team Roster URLs...

\href="(/nba/teams/roster\?team=[^"]+)">([^<]+)</a>\

Then I'd get the resulting match groups, where \1 is the last portion of the URL and \2 is the Team Name. Then I'd use those URLs to scrape each team roster page looking for Player URLs...

\href="(http://espn.go.com/nba/player/_/id/[^"]+)">([^<]+)</a>\

I'd finally get the resulting match groups, where \1 is the URL for the player page and \2 is the Player Name. I'd scrape each resulting URL for the info I needed.

Regular Expressions are the bomb.

Hope this helps.

Collectives™ on Stack Overflow

Optimizing my Python Scraper

3 Answers 3

2 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Related