Revisions to Matching values from html table for updating values in pandas dataframe

Notice removed Reward existing answer by Grajdeanu Alex

occurred Oct 12, 2016 at 7:48

Bounty Ended with Parfait's answer chosen by Grajdeanu Alex

occurred Oct 12, 2016 at 7:48

Notice added Reward existing answer by Grajdeanu Alex

occurred Oct 11, 2016 at 7:03

Bounty Started worth 100 reputation by Grajdeanu Alex

occurred Oct 11, 2016 at 7:03

added 22 characters in body; edited tags

Source Link

edited Oct 10, 2016 at 21:47

Jamal

35.2k
13
134
238

This is more of an exercise for me to get use to pandasPandas and its dataframesdataframes. For those who didn't hear of it:

pandas Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive

Given some link http://ABCD.abc/some_date.html, take the necessary information from the table on the page.

Say the information looks like this:

Team  | Another Team | Col2 | Current  | Col4 | Halftime  | Scores

Team1 | TeamX        | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY        | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW        | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ        | info | Current4 | Col4 | Halftime4 | Scores4

Team  | Another Team | Col2 | Current  | Col4 | Halftime  | Scores

Team1 | TeamX        | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY        | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW        | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ        | info | Current4 | Col4 | Halftime4 | Scores4

  | Team  | Opponent | Col2 | Col3   Col4 | Col5 | Col6 | Date

0 | Team1 | TeamX    | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 
1 | TeamX | Team1    | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link                                             
2 | Team3 | TeamW    | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link
3 | TeamW | Team3    | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link
...
and so on

  | Team  | Opponent | Col2 | Col3   Col4 | Col5 | Col6 | Date

0 | Team1 | TeamX    | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 
1 | TeamX | Team1    | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link              

2 | Team3 | TeamW    | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link
3 | TeamW | Team3    | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link

...
and so on

Parse each row from the dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.
If you find it, take Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.
doDo this for each row from the dataframe.

I'd like some ideas on how can I optimise it, make better use of pandas modules and if there's something wrong with the logic.

You can find the code below:

import pickle
import requests
import pandas as pd

from bs4 import BeautifulSoup


def get_df_from_file(pickle_filename):
    objects = []
    with open(pickle_filename, "rb") as openfile:
        objects.append(pickle.load(openfile))
    return objects


def add_new_df_columns():
    return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))


def get_html_data_from_url(custom_date):
    url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')

    rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
    teams, currents, halftimes, scores = [], [], [], []

    for row in rows:
        cells = row.find_all("td")
        teams.append(cells[0].get_text().encode('utf-8'))
        currents.append(cells[3].get_text().encode('utf-8'))
        halftimes.append(cells[5].get_text().encode('utf-8'))
        scores.append(cells[6].get_text().encode('utf-8'))

    data = {
        'teams': teams,
        'currents': currents,
        'halftimes': halftimes,
        'scores': scores
    }

    return data


def process_data():
    df_objects = add_new_df_columns()  # data from file

    for index, row in df_objects.iterrows():
        html_data = get_html_data_from_url(row['Date'])  # dict from html
        for index_1, item in enumerate(html_data['teams']):
            if row['Team'] in item:
                # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
                df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
                df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
                df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
    print(df_objects)


if __name__ == '__main__':
    process_data()

import pickle
import requests
import pandas as pd

from bs4 import BeautifulSoup


def get_df_from_file(pickle_filename):
    objects = []
    with open(pickle_filename, "rb") as openfile:
        objects.append(pickle.load(openfile))
    return objects


def add_new_df_columns():
    return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))


def get_html_data_from_url(custom_date):
    url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')

    rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
    teams, currents, halftimes, scores = [], [], [], []

    for row in rows:
        cells = row.find_all("td")
        teams.append(cells[0].get_text().encode('utf-8'))
        currents.append(cells[3].get_text().encode('utf-8'))
        halftimes.append(cells[5].get_text().encode('utf-8'))
        scores.append(cells[6].get_text().encode('utf-8'))

    data = {
        'teams': teams,
        'currents': currents,
        'halftimes': halftimes,
        'scores': scores
    }

    return data


def process_data():
    df_objects = add_new_df_columns()  # data from file

    for index, row in df_objects.iterrows():
        html_data = get_html_data_from_url(row['Date'])  # dict from html
        for index_1, item in enumerate(html_data['teams']):
            if row['Team'] in item:
                # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
                df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
                df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
                df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
    print(df_objects)


if __name__ == '__main__':
    process_data()

This is more of an exercise for me to get use to pandas and its dataframes. For those who didn't hear of it:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive

Given some link http://ABCD.abc/some_date.html, take the necessary information from the table on the page. Say the information looks like this:

Team  | Another Team | Col2 | Current  | Col4 | Halftime  | Scores

Team1 | TeamX        | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY        | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW        | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ        | info | Current4 | Col4 | Halftime4 | Scores4

  | Team  | Opponent | Col2 | Col3   Col4 | Col5 | Col6 | Date

0 | Team1 | TeamX    | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 
1 | TeamX | Team1    | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link                                             
2 | Team3 | TeamW    | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link
3 | TeamW | Team3    | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link
...
and so on

Parse each row from the dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.
If you find it, take Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.
do this for each row from the dataframe.

I'd like some ideas on how can I optimise it, make better use of pandas modules and if there's something wrong with the logic.

You can find the code below:

import pickle
import requests
import pandas as pd

from bs4 import BeautifulSoup


def get_df_from_file(pickle_filename):
    objects = []
    with open(pickle_filename, "rb") as openfile:
        objects.append(pickle.load(openfile))
    return objects


def add_new_df_columns():
    return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))


def get_html_data_from_url(custom_date):
    url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')

    rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
    teams, currents, halftimes, scores = [], [], [], []

    for row in rows:
        cells = row.find_all("td")
        teams.append(cells[0].get_text().encode('utf-8'))
        currents.append(cells[3].get_text().encode('utf-8'))
        halftimes.append(cells[5].get_text().encode('utf-8'))
        scores.append(cells[6].get_text().encode('utf-8'))

    data = {
        'teams': teams,
        'currents': currents,
        'halftimes': halftimes,
        'scores': scores
    }

    return data


def process_data():
    df_objects = add_new_df_columns()  # data from file

    for index, row in df_objects.iterrows():
        html_data = get_html_data_from_url(row['Date'])  # dict from html
        for index_1, item in enumerate(html_data['teams']):
            if row['Team'] in item:
                # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
                df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
                df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
                df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
    print(df_objects)


if __name__ == '__main__':
    process_data()

This is more of an exercise for me to get use to Pandas and its dataframes. For those who didn't hear of it:

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and intuitive

Given some link http://ABCD.abc/some_date.html, take the necessary information from the table on the page.

Say the information looks like this:

Team  | Another Team | Col2 | Current  | Col4 | Halftime  | Scores

Team1 | TeamX        | info | Current1 | Col4 | Halftime1 | Scores1
Team2 | TeamY        | info | Current2 | Col4 | Halftime2 | Scores2
Team3 | TeamW        | info | Current3 | Col4 | Halftime3 | Scores3
Team4 | TeamZ        | info | Current4 | Col4 | Halftime4 | Scores4

  | Team  | Opponent | Col2 | Col3   Col4 | Col5 | Col6 | Date

0 | Team1 | TeamX    | info | info | info | info | info | some_date1 <-- see the link. date goes there in the link 
1 | TeamX | Team1    | info | info | info | info | info | some_date2 <-- see the link. date goes there in the link              

2 | Team3 | TeamW    | info | info | info | info | info | some_date3 <-- see the link. date goes there in the link
3 | TeamW | Team3    | info | info | info | info | info | some_date4 <-- see the link. date goes there in the link

...
and so on

Parse each row from the dataframe (access the link using the date from the Date column of that row), and check if the team from this row can be found in the HTML table.
If you find it, take Current, Halftime and Scores from the table and add the info into the newly created dataframe columns.
Do this for each row from the dataframe.

I'd like some ideas on how can I optimise it, make better use of pandas modules and if there's something wrong with the logic.

import pickle
import requests
import pandas as pd

from bs4 import BeautifulSoup


def get_df_from_file(pickle_filename):
    objects = []
    with open(pickle_filename, "rb") as openfile:
        objects.append(pickle.load(openfile))
    return objects


def add_new_df_columns():
    return get_df_from_file('CFB_15_living-2.p')[0].join(pd.DataFrame(columns=['Currents', 'Halftimes', 'Scores']))


def get_html_data_from_url(custom_date):
    url = 'http://www.scoresandodds.com/grid_{}.html'.format(custom_date)
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'lxml')

    rows = soup.find("table", {'class': 'data'}).find_all("tr", {'class': ['team odd', 'team even']})
    teams, currents, halftimes, scores = [], [], [], []

    for row in rows:
        cells = row.find_all("td")
        teams.append(cells[0].get_text().encode('utf-8'))
        currents.append(cells[3].get_text().encode('utf-8'))
        halftimes.append(cells[5].get_text().encode('utf-8'))
        scores.append(cells[6].get_text().encode('utf-8'))

    data = {
        'teams': teams,
        'currents': currents,
        'halftimes': halftimes,
        'scores': scores
    }

    return data


def process_data():
    df_objects = add_new_df_columns()  # data from file

    for index, row in df_objects.iterrows():
        html_data = get_html_data_from_url(row['Date'])  # dict from html
        for index_1, item in enumerate(html_data['teams']):
            if row['Team'] in item:
                # print('True: {} -> {}; Index: {}'.format(row['Team'], item, index))
                df_objects.set_value(index, 'Currents', html_data['currents'][index_1])
                df_objects.set_value(index, 'Halftimes', html_data['halftimes'][index_1])
                df_objects.set_value(index, 'Scores', html_data['scores'][index_1])
    print(df_objects)


if __name__ == '__main__':
    process_data()

added 222 characters in body

Source Link

edited Oct 10, 2016 at 11:26

Grajdeanu Alex

9.3k
4
32
71

After some tests, it looks like add_new_df_columns() is the function that takes the most time to execute, and that's because I always take the date from the row I'm at that point, and make a request using it.

Tweeted twitter.com/StackCodeReview/status/785372682049159168

occurred Oct 10, 2016 at 6:53

added 100 characters in body

Source Link

edited Oct 10, 2016 at 5:55

Grajdeanu Alex

9.3k
4
32
71

From fileA (data from the file is pickled - yeah, I know pickling isn't the best option, but let's stick with it for the sake of the exercise), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.

From fileA (data from the file is pickled), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.

From fileA (data from the file is pickled - yeah, I know pickling isn't the best option, but let's stick with it for the sake of the exercise), add the info at the end of the dataframe in another 3 new columns: Current, Halftime and Scores.

deleted 13 characters in body

Source Link

edited Oct 9, 2016 at 6:59

Grajdeanu Alex

9.3k
4
32
71

Loading

Source Link

asked Oct 8, 2016 at 17:37

Grajdeanu Alex

9.3k
4
32
71

Loading

Stack Exchange Network

Return to Question