Return to Answer

added 405 characters in body

Source Link

edited Oct 10, 2016 at 14:21

Grajdeanu Alex

9.3k
4
32
71

LE: And that could also fail for the following test case:

I'm always storing the first date in main_html_data

Then if there's a new date, I'll add it to my list

So now my list would look like: dates = ['date_1', 'date_2']

Now if the date on the third row is again date_1, I'll get the html of date_2 link, as that's the last one I've checked. No ideas how to resolve this. Yet.

LE: And that could also fail for the following test case:

I'm always storing the first date in main_html_data

Then if there's a new date, I'll add it to my list

So now my list would look like: dates = ['date_1', 'date_2']

Now if the date on the third row is again date_1, I'll get the html of date_2 link, as that's the last one I've checked. No ideas how to resolve this. Yet.

Source Link

answered Oct 10, 2016 at 14:11

Grajdeanu Alex

9.3k
4
32
71

Ok, so far so good, I found a way to reduce the time of execution by storing the dates in a list.

The process would be:

as we know, the link is formed as follows: http://link.com/grid_`row['dates']`
so, if there's the same date everywhere, there's no need to request the same page each time.

That being said I've got the following snippet:

import pickle
import requests
import pandas as pd

from bs4 import BeautifulSoup


FILE_TO_PROCESS = 'pickle_file.txt'


def get_df_from_file():
    with open(FILE_TO_PROCESS, "rb") as openfile:
        return pickle.load(openfile).join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))


def get_html_data_from_url(custom_date):
    url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')

    rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
    teams, currents, halftimes, scores = [], [], [], []

    for row in rows:
        cells = row.find_all("td")

        teams.append(cells[0].get_text().encode('utf-8'))
        currents.append(cells[3].get_text().encode('utf-8'))
        halftimes.append(cells[5].get_text().encode('utf-8'))
        scores.append(cells[6].get_text().encode('utf-8'))

    data = {
        'teams': teams,
        'currents': currents,
        'halftimes': halftimes,
        'scores': scores
    }

    return data


def process_data():
    df_objects = get_df_from_file()

    dates = []
    first_date = df_objects.iloc[0]['Date']
    main_html_data = get_html_data_from_url(first_date)

    for index, row in df_objects.iterrows():
        if index < 1:
            html_data = main_html_data
            dates.append(first_date)

        else:
            if index >= 1 and row['Date'] in dates:
                html_data = main_html_data
            elif index >= 1 and row['Date'] not in dates:
                html_data = get_html_data_from_url(row['Date'])
                dates.append(row['Date'])

        for index_1, item in enumerate(html_data['teams']):
            if row['Team'] in item:
                # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
                df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
                df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
                df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
        # print('--------------------------')

    df_objects.to_csv('results.csv', sep='\t')


if __name__ == '__main__':
    process_data()

More, I also realized that there's no need to store the dataframe objects in a list when I could actually only return the dataframe and join the needed extra columns, all in the same function.

If you have any other suggestions, I would strongly recommend you guys to go for it.