Return to Revisions

1 of 12

asked Dec 28, 2022 at 22:18

Python BeautifulSoup - preparing HTML rows and td tags for Pandas

I'm using BeautifulSoup to parse a bunch of combined tables' rows, row by row, column by column prepare it for import into Pandas. I can't use to_html() because one of the columns has a list of tag links in each cell.

The data structure is the same in all the tables. But there are a couple of particular columns I'd like to drop: td.a.img and td.div with {'class': 'stars'} and . So a couple of questions:

Should I be dropping the columns before importing to Pandas or add extra column names in the header before importing to Pandas and then unwanted dropping columns? I've done both and don't know if one's better than the other. The wishy-washiness is messing me up.
Assuming dropping the columns BEFORE importing to Pandas: I can't figure out the correct method to skip a td.div tag containing the class attribute 'stars'. My following code works but it doesn't seem like the correct method. I can't just do a if col.div: continue because some of the required columns have extra <div> tags I need for later.
As a secondary question, my collect_book_rows_from_html_files() takes about 9-10 seconds to run on 34 files. Each table has 100 rows. Total of rows is ~3380. Is there a better way to get the table data from each file? I have 2 methods I've used to read in rows. Method 2 seems better to me, but maybe there's a better way?

I'd appreciate any insights. Thank you!

    def collect_book_rows_from_html_files(self):
        tic = time.perf_counter()
        for file in self.cache_files:
            with open(file, 'r', encoding='utf-8') as f:
                soup = BeautifulSoup(f.read(), features="lxml", parse_only=SoupStrainer('table'))
                rows = soup.find_all('tr')[1:]
                ## method 1: 
                ## self.html_rows.extend([row for row in soup.find_all('tr')[1:]])
            self.html_rows.extend(rows) # method 2
            
        toc = time.perf_counter()
        print(f"Total time to extract HTML tables from files: {toc - tic:0.4f} seconds")

    def extract_data_from_html_rows(self):
        for row in self.html_rows:
            new_row = self.rebuild_row(row)
            self.final_html_rows.append(new_row.copy())

    def rebuild_row(self, row):
        new_row = []
        for col in row.find_all('td'):
            if col.div:
                if 'stars' in str(col.div.attrs):
                    continue
            if col.a:
                if col.a.img:
                    continue
                else:
                    new_row.append(self.handle_links(col)) # returns a tuple of (name, url)
            else:
                if not col.text or not col.text.strip():
                    new_row.append(['NaN'])
                else:
                    new_text = self.clean_tag_text(col)
                    new_row.append(new_text)
        return new_row

HTML row example:

<tr id="groupBook3144889">
<td width="5%"><a href="/book/show/40851529-the-7th-of-victorica"><img alt="The 7th of Victorica by Beau Schemery" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1531785241l/40851529._SY75_.jpg" title="The 7th of Victorica by Beau Schemery"/></a></td>
<td width="30%">
<a href="/book/show/40851529-the-7th-of-victorica">The 7th of Victorica (Gadgets and Shadows, #2)</a>
</td>
<td width="10%">
<a href="/author/show/6594115.Beau_Schemery">Schemery, Beau</a>
<span title="Goodreads Author!">*</span>
</td>
<td width="1%">
<div class="stars" data-rating="0" data-resource-id="40851529" data-restore-rating="null" data-submit-url="/review/rate/40851529?stars_click=false" data-user-id="0"><a class="star off" href="#" ref="" title="did not like it">1 of 5 stars</a><a class="star off" href="#" ref="" title="it was ok">2 of 5 stars</a><a class="star off" href="#" ref="" title="liked it">3 of 5 stars</a><a class="star off" href="#" ref="" title="really liked it">4 of 5 stars</a><a class="star off" href="#" ref="" title="it was amazing">5 of 5 stars</a></div>
</td>
<td width="1%">
<a class="actionLinkLite" href="/group/bookshelf/64285-lgbtq-gsm-fantasy-science-fiction?shelf=read">read</a>, 
                    <a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-action-adventure">genre-action-adve...</a>, 
                    <a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-steampunk-dieselpunk">genre-steampunk-d...</a>, 
                    <a class="actionLinkLite" href="/group/bookshelf/64285?shelf=genre-young-adult">genre-young-adult</a>
</td>
<td width="1%">
                 
          </td>
<td width="1%">
                 
          </td>
<td width="1%">
<a href="/user/show/4872508-meghan"> Meghan</a>
</td>
<td width="1%">2022/12/25</td>
<td class="view" width="1%">
<a class="actionLink" href="/group/show_book/64285?group_book_id=3144889" style="white-space: nowrap">view activity »</a>
</td>
</tr>
```

asked Dec 28, 2022 at 22:18

Meghan M.