Return to Revisions

2 of 5

deleted 13 characters in body

edited Oct 9, 2016 at 6:59

9.3k
4
32
71

Matching values from html table for updating values in pandas dataframe

This is more of an exercise for me to get use to pandas and its dataframes. For those who didn't hear of it:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive

I'll make this sound like an exercise:

Given some link http://ABCD.abc/some_date.html, take the necessary information from the table on the page. Say the information looks like this:

Team  | Another Team | Col2 | Current  | Col4 | Halftime  | Scores

Team1 | TeamX        | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY        | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW        | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ        | info | Current4 | Col4 | Halftime4 | Scores4

From fileA (data from the file is pickled), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.

Let's suppose the data in the dataframe looks like this:

  | Team  | Opponent | Col2 | Col3   Col4 | Col5 | Col6 | Date

0 | Team1 | TeamX    | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 
1 | TeamX | Team1    | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link                                             
2 | Team3 | TeamW    | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link
3 | TeamW | Team3    | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link
...
and so on

Now, the task:

Parse each row from the dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.
If you find it, take Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.
do this for each row from the dataframe.

Now, I did solve this pretty easy, but it takes up to 1 minute to resolve 137 rows in the dataframe.

I'd like some ideas on how can I optimise it, make better use of pandas modules and if there's something wrong with the logic.

You can find the code below:

import pickle
import requests
import pandas as pd

from bs4 import BeautifulSoup


def get_df_from_file(pickle_filename):
    objects = []
    with open(pickle_filename, "rb") as openfile:
        objects.append(pickle.load(openfile))
    return objects


def add_new_df_columns():
    return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))


def get_html_data_from_url(custom_date):
    url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')

    rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
    teams, currents, halftimes, scores = [], [], [], []

    for row in rows:
        cells = row.find_all("td")
        teams.append(cells[0].get_text().encode('utf-8'))
        currents.append(cells[3].get_text().encode('utf-8'))
        halftimes.append(cells[5].get_text().encode('utf-8'))
        scores.append(cells[6].get_text().encode('utf-8'))

    data = {
        'teams': teams,
        'currents': currents,
        'halftimes': halftimes,
        'scores': scores
    }

    return data


def process_data():
    df_objects = add_new_df_columns()  # data from file

    for index, row in df_objects.iterrows():
        html_data = get_html_data_from_url(row['Date'])  # dict from html
        for index_1, item in enumerate(html_data['teams']):
            if row['Team'] in item:
                # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
                df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
                df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
                df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
    print(df_objects)


if __name__ == '__main__':
    process_data()

asked Oct 8, 2016 at 17:37

Grajdeanu Alex

9.3k
4
32
71