Reformatting scraped selenium table

Question

I'm scraping a table that displays info for a sporting league. So far so good for a selenium beginner:

from selenium import webdriver
import re
import pandas as pd

driver = webdriver.PhantomJS(executable_path=r'C:/.../bin/phantomjs.exe')

driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")

infotable = driver.find_elements_by_class_name("table-main")
matches = driver.find_elements_by_class_name("table-participant")
ilist, match = [], []

for i in infotable:
    ilist.append(i.text)
    infolist = ilist[0]

for i in matches:
    match.append(i.text)

driver.close()

home = pd.Series([item.split(' - ')[0] for item in match])
away = pd.Series([item.strip().split(' - ')[1] for item in match])

df = pd.DataFrame({'home' : home, 'away' : away})

date = re.findall("\d\d\s\w\w\w\s\d\d\d\d", infolist)

In the last line, date scrapes all the dates in the table but I can't link them to the corresponding game.

My thinking is: for child/element "under the date", date = last_found_date.

Ultimate goal is to have two more columns in df, one with the date of the match and the next if any text found beside the date, for example 'Play Offs' (I can figure that out myself if I can get the date issue sorted).

Should I be incorporating another program/method to retain order of tags/elements of the table?

I can't access the date tags is bottleneck. Neither date = driver.find_elements_by_class_name("nob-border") nor date = driver.find_elements_by_class_name("first2") allow me to locate the elements. Thus I can't count how many "table-participant" elements belong to each date. I don't know what to use to preserve structure or order. Gap in knowledge is main problem. — noblerthanoedipus
– noblerthanoedipus, Commented Feb 6, 2016 at 0:06

alecxe · Accepted Answer · 2016-02-06 00:34:06Z

You would need to change the way you extract the match information. Instead of separately extracting home and away teams, do it in one loop also extracting the dates and events:

from selenium import webdriver

import pandas as pd

driver = webdriver.PhantomJS()
driver.get("http://www.oddsportal.com/hockey/usa/nhl-2014-2015/results/#/page/2.html")

data = []
for match in driver.find_elements_by_css_selector("div#tournamentTable tr.deactivate"):
    home, away = match.find_element_by_class_name("table-participant").text.split(" - ")
    date = match.find_element_by_xpath(".//preceding::th[contains(@class, 'first2')][1]").text

    if " - " in date:
        date, event = date.split(" - ")
    else:
        event = "Not specified"

    data.append({
        "home": home.strip(),
        "away": away.strip(),
        "date": date.strip(),
        "event": event.strip()
    })

driver.close()

df = pd.DataFrame(data)
print(df)

Prints:

                     away         date          event                 home
0     Washington Capitals  25 Apr 2015      Play Offs   New York Islanders
1          Minnesota Wild  25 Apr 2015      Play Offs       St.Louis Blues
2         Ottawa Senators  25 Apr 2015      Play Offs   Montreal Canadiens
3     Pittsburgh Penguins  25 Apr 2015      Play Offs     New York Rangers
4          Calgary Flames  24 Apr 2015      Play Offs    Vancouver Canucks
5      Chicago Blackhawks  24 Apr 2015      Play Offs  Nashville Predators
6     Tampa Bay Lightning  24 Apr 2015      Play Offs    Detroit Red Wings
7      New York Islanders  24 Apr 2015      Play Offs  Washington Capitals
8          St.Louis Blues  23 Apr 2015      Play Offs       Minnesota Wild
9           Anaheim Ducks  23 Apr 2015      Play Offs        Winnipeg Jets
10     Montreal Canadiens  23 Apr 2015      Play Offs      Ottawa Senators
11       New York Rangers  23 Apr 2015      Play Offs  Pittsburgh Penguins
12      Vancouver Canucks  22 Apr 2015      Play Offs       Calgary Flames
13    Nashville Predators  22 Apr 2015      Play Offs   Chicago Blackhawks
14    Washington Capitals  22 Apr 2015      Play Offs   New York Islanders
15    Tampa Bay Lightning  22 Apr 2015      Play Offs    Detroit Red Wings
16          Anaheim Ducks  21 Apr 2015      Play Offs        Winnipeg Jets
17         St.Louis Blues  21 Apr 2015      Play Offs       Minnesota Wild
18       New York Rangers  21 Apr 2015      Play Offs  Pittsburgh Penguins
19      Vancouver Canucks  20 Apr 2015      Play Offs       Calgary Flames
20     Montreal Canadiens  20 Apr 2015      Play Offs      Ottawa Senators
21    Nashville Predators  19 Apr 2015      Play Offs   Chicago Blackhawks
22    Washington Capitals  19 Apr 2015      Play Offs   New York Islanders
23          Winnipeg Jets  19 Apr 2015      Play Offs        Anaheim Ducks
24    Pittsburgh Penguins  19 Apr 2015      Play Offs     New York Rangers
25         Minnesota Wild  18 Apr 2015      Play Offs       St.Louis Blues
26      Detroit Red Wings  18 Apr 2015      Play Offs  Tampa Bay Lightning
27         Calgary Flames  18 Apr 2015      Play Offs    Vancouver Canucks
28     Chicago Blackhawks  18 Apr 2015      Play Offs  Nashville Predators
29        Ottawa Senators  18 Apr 2015      Play Offs   Montreal Canadiens
30     New York Islanders  18 Apr 2015      Play Offs  Washington Capitals
31          Winnipeg Jets  17 Apr 2015      Play Offs        Anaheim Ducks
32         Minnesota Wild  17 Apr 2015      Play Offs       St.Louis Blues
33      Detroit Red Wings  17 Apr 2015      Play Offs  Tampa Bay Lightning
34    Pittsburgh Penguins  17 Apr 2015      Play Offs     New York Rangers
35         Calgary Flames  16 Apr 2015      Play Offs    Vancouver Canucks
36     Chicago Blackhawks  16 Apr 2015      Play Offs  Nashville Predators
37        Ottawa Senators  16 Apr 2015      Play Offs   Montreal Canadiens
38     New York Islanders  16 Apr 2015      Play Offs  Washington Capitals
39        Edmonton Oilers  12 Apr 2015  Not specified    Vancouver Canucks
40          Anaheim Ducks  12 Apr 2015  Not specified      Arizona Coyotes
41     Chicago Blackhawks  12 Apr 2015  Not specified   Colorado Avalanche
42    Nashville Predators  12 Apr 2015  Not specified         Dallas Stars
43          Boston Bruins  12 Apr 2015  Not specified  Tampa Bay Lightning
44    Pittsburgh Penguins  12 Apr 2015  Not specified       Buffalo Sabres
45      Detroit Red Wings  12 Apr 2015  Not specified  Carolina Hurricanes
46      New Jersey Devils  12 Apr 2015  Not specified     Florida Panthers
47  Columbus Blue Jackets  12 Apr 2015  Not specified   New York Islanders
48     Montreal Canadiens  12 Apr 2015  Not specified  Toronto Maple Leafs
49         Calgary Flames  11 Apr 2015  Not specified        Winnipeg Jets

This has helped me a lot. Negates need for any regex and now I'm clued in on how to use xpath & css. Many thanks for taking the time to code a better solution

Collectives™ on Stack Overflow

Reformatting scraped selenium table

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related