I'm working on some NFL statistics web scraping, honestly the activity doesn't matter much. I spent a ton of time debugging because I couldn't believe what it was doing, either I'm going crazy or there is some sort of bug in a package or python itself. Here's the code I'm working with:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests
import string
import numpy as np
#get player list
players = pd.DataFrame({"name":[],"url":[],"positions":[],"startYear":[],"endYear":[]})
letters = list(string.ascii_uppercase)
for letter in letters:
print(letter)
players_html = requests.get("https://www.pro-football-reference.com/players/"+letter+"/")
soup = bs(players_html.content,"html.parser")
for player in soup.find("div",{"id":"div_players"}).find_all("p"):
temp_row = {}
temp_row["url"] = "https://www.pro-football-reference.com"+player.find("a")["href"]
temp_row["name"] = player.text.split("(")[0].strip()
years = player.text.split(")")[1].strip()
temp_row["startYear"] = int(years.split("-")[0])
temp_row["endYear"] = int(years.split("-")[1])
temp_row["positions"] = player.text.split("(")[1].split(")")[0]
players = players.append(temp_row,ignore_index=True)
players = players[players.endYear > 2000]
players.reset_index(inplace=True,drop=True)
game_df = pd.DataFrame()
def apply_test(row):
#print(row)
url = row['url']
#print(list(range(int(row['startYear']),int(row['endYear'])+1)))
for yr in range(int(row['startYear']),int(row['endYear'])+1):
print(yr)
content = requests.get(url.split(".htm")[0]+"/gamelog/"+str(yr)).content
soup = bs(content,'html.parser').find("div",{"id":"all_stats"})
#overheader
over_headers = []
for over in soup.find("thead").find("tr").find_all("th"):
if("colspan" in over.attrs.keys()):
for i in range(0,int(over['colspan'])):
over_headers = over_headers + [over.text]
else:
over_headers = over_headers + [over.text]
#headers
headers = []
for header in soup.find("thead").find_all("tr")[1].find_all("th"):
headers = headers + [header.text]
all_headers = [a+"___"+b for a,b in zip(over_headers,headers)]
#remove first column, it's meaningless
all_headers = all_headers[1:len(all_headers)]
for row in soup.find("tbody").find_all("tr"):
temp_row = {}
for i,col in enumerate(row.find_all("td")):
temp_row[all_headers[i]] = col.text
game_df = game_df.append(temp_row,ignore_index=True)
players.apply(apply_test,axis=1)
Now again I could get into what I'm trying to do, but there seems to be a much higher-level issue here. startYear and endYear in the for loop are 2013 and 2014, so the loop should be setting the yr variable to 2013 then 2014. But when you look at what prints out due to the print(yr), you realize it's printing out 2013 twice. But if you simply comment out the game_df = game_df.append(temp_row,ignore_index=True) line, the printouts of yr are correct. There is an error shortly after the first two lines, but that is expected and one I am comfortable debugging. But the fact that appending to a global dataframe is causing a for loop to behave differently is blowing my mind right now. Can someone help with this?
Thanks.