0

I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np

#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
    print(letter)
    players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
    soup = bs(players_html.content,"html.parser")
    for player in soup.find("div",{"id":"div_players"}).find_all("p"):
        temp_row = {}
        temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
        temp_row["name"] = player.text.split("(")[0].strip()
        years = player.text.split(")")[1].strip()
        temp_row["startYear"] = int(years.split("-")[0])
        temp_row["endYear"] = int(years.split("-")[1])
        temp_row["positions"] = player.text.split("(")[1].split(")")[0]
        players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)

game_df = pd.DataFrame()
def apply_test(row):
    #print(row)
    url = row['url']
    #print(list(range(int(row['startYear']),int(row['endYear'])+1)))
    for yr in range(int(row['startYear']),int(row['endYear'])+1):
        print(yr)
        content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
        soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
        #overheader
        over_headers = []
        for over in soup.find("thead").find("tr").find_all("th"):
            if("colspan" in over.attrs.keys()):
                for i in range(0,int(over['colspan'])):
                    over_headers = over_headers + [over.text]
            else:
                over_headers = over_headers + [over.text]
        #headers
        headers = []
        for header in soup.find("thead").find_all("tr")[1].find_all("th"):
            headers = headers + [header.text]
        all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
        #remove first column, it's meaningless
        all_headers = all_headers[1:len(all_headers)]
        for row in soup.find("tbody").find_all("tr"):
            temp_row = {}
            for i,col in enumerate(row.find_all("td")):
                temp_row[all_headers[i]] = col.text
            game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)


Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr), you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?

Thanks.

0

1 Answer 1

1

I don't really follow what the overall aim is but I do note two things:

  1. You either need the local game_df to be declared as global game_df before game_df = game_df.append(temp_row,ignore_index=True) or better still pass as an arg in the def signature though you would need to amend this: players.apply(apply_test,axis=1) accordingly.

  2. You need to handle the cases of find returning None e.g. with soup.find("thead").find_all("tr")[1].find_all("th") for page https://www.pro-football-reference.com/players/A/AaitIs00/gamelog/2014. Perhaps put in try except blocks with appropriate default values to be supplied.

Sign up to request clarification or add additional context in comments.

5 Comments

Yeah the checking for None is normal I'm aware of how to troubleshoot that. For the global variables I'm a bit confused I thought variables defined outside functions were by default global. Is there a global tag in python? The other question is why is whether the variable is global or not causing the for loop to act differently? Thanks.
Yes it is causing different behaviour and the code is seeing the game_df in your def as a local variable as you haven't used global keyword and also as being referenced before assignment. At least that is my interpretation but I am still fairly new to Python.
Some quick googling shows that I don't believe global is an actual thing you have to declare in python. There is no global keyword and given that I defined the DataFrame outside of any function it should be global by default. I still don't see how any issue of the sort could possibly cause the for loop to have yr=2013 for both iterations despite the fact that it should be 2013 and then 2014.
geeksforgeeks.org/global-local-variables-python first google result for python altering global variables in a function
Oh wow interesting I thought python acted as other languages such as js and if it weren't defined in the local scope it would realize it was the global. I added global game_df to the start of the function and it worked. I guess I need to do some reading. Thanks :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.