2

Kind of a long winded question and I probably just need someone to point me in the right direction. I'm building a web scraper to grab basketball player info from ESPN's website. The URL structure is pretty simple in that each player card has a specific id in the URL. To obtain information I'm writing a loop from 1-~6000 to grab players from their database. My question is whether there is a more efficient way of doing this?

from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests 
import nltk
import re




age = [] # Empty List to store player ages

BASE = 'http://espn.go.com/nba/player/stats/_/id/' # Base Structure of Player Card URL
def get_age(BASE): #Creates a function
    #z = range(1,6000) # Create Range from 1 to 6000
    for i in range(1, 6000): # This is a for loop
        BASE_U = BASE + str(i) + '/' # Create URL For Player   
        r = requests.get(BASE_U)
        soup = BeautifulSoup(r.text)
        #Prior to this step, I had to print out the soup object and look through the HTML in order to find the tag that contained my desired information 
        # Get Age of Players        
        age_tables = soup.find_all('ul', class_="player-metadata") # Grabs all text in the metadata tag
        p = str(age_tables) # Turns text into a string
    #At this point I had to look at all the text in the p object and determine a way to capture the age info
        if "Age: " not in p: # PLayer ID doesn't exist so go to next to avoid error
        continue
        else:
            start = p.index("Age: ") + len("Age: ") # Gets the location of the players age 
            end = p[start:].index(")") + start  
            player_id.append(i) #Adds player_id to player_id list
            age.append(p[start:end]) # Adds player's age to age list

get_age(BASE)

Any help, even small, would be much appreciated. Even if it's just pointing me in the right direction, and not necessarily a direct solution

Thanks, Ben

3 Answers 3

1

It's like the port scaner in network security, multi-thread will faster you program very much.

Sign up to request clarification or add additional context in comments.

2 Comments

Ah I've heard about multi-threading. Do you know of easy to follow online tutorials?
I personally thought the documentation for the multiprocessing library was a good place to start. You might look into guides for that library if the documentation isn't good enough for you.
1

Not only more efficient, but also a more organized and scalable approach would involve switching to Scrapy web-scraping framework.

The main performance problem you have is because of the "blocking" nature of your current approach - Scrapy would solve it out-of-the-box because it is based on twisted and is completely asynchronous.

2 Comments

@Ben sure, let me know if you need help making a scrapy spider. You would be surprised how easy would it be to start working with Scrapy.
Thanks alecxe! I may reach out
0

I'd probably start with http://espn.go.com/nba/players and use the following Regular Expression to get the Team Roster URLs...

\href="(/nba/teams/roster\?team=[^"]+)">([^<]+)</a>\

Then I'd get the resulting match groups, where \1 is the last portion of the URL and \2 is the Team Name. Then I'd use those URLs to scrape each team roster page looking for Player URLs...

\href="(http://espn.go.com/nba/player/_/id/[^"]+)">([^<]+)</a>\

I'd finally get the resulting match groups, where \1 is the URL for the player page and \2 is the Player Name. I'd scrape each resulting URL for the info I needed.

Regular Expressions are the bomb.

Hope this helps.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.