Revisions to Python web-scraper to download table of transistor counts from Wikipedia

added 75 characters in body

Source Link

edited Sep 13, 2019 at 19:15

Carcigenicate

16.6k
3
37
82

Those are just normal variables, not class names, so they should be lower-caseso they should be lower-case:

added 580 characters in body

Source Link

edited Sep 13, 2019 at 18:19

Carcigenicate

16.6k
3
37
82

import bs4

def eliminate_newlines(tag: bs4.element.Tag) -> str:  # Maybe pick a better name
    try:
        return tag.text.replace('\n', '')

    except AttributeError:  # I'm assuming this is what you intend to catch
        return ""

Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:

from typing import List

def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
    return tags[i].text.replace('\n', '') if len(tags) < i else ""

This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.

import bs4

def eliminate_newlines(tag: bs4.element.Tag):  # Maybe pick a better name
    try:
        return tag.text.replace('\n', '')

    except AttributeError:  # I'm assuming this is what you intend to catch
        return ""

import bs4

def eliminate_newlines(tag: bs4.element.Tag) -> str:  # Maybe pick a better name
    try:
        return tag.text.replace('\n', '')

    except AttributeError:  # I'm assuming this is what you intend to catch
        return ""

Edit: I was in the shower, and realized that you're actually probably intending to catch an IndexError in case the page is malformed or something. Same idea though, move that code out into a function to reduce duplication. Something like:

from typing import List

def eliminate_newlines(tags: List[bs4.element.Tag], i: int) -> str:
    return tags[i].text.replace('\n', '') if len(tags) < i else ""

This could also be done using a condition statement instead of expression. I figured that it's pretty simple though, so a one-liner should be fine.

Source Link

answered Sep 13, 2019 at 17:47

Carcigenicate

16.6k
3
37
82

Make sure to follow naming conventions. You name two variables inappropriately:

My_table = soup.find('table',{'class':'wikitable sortable'})
My_second_table = My_table.find_next_sibling('table')

Those are just normal variables, not class names, so they should be lower-case:

my_table = soup.find('table',{'class':'wikitable sortable'})
my_second_table = my_table.find_next_sibling('table')

Twice you do

try:
    title = tds[0].text.replace('\n','')
except:
    title = ""

I'd specify what exact exception you want to catch so you don't accidentally hide a "real" error if you start making changes in the future. I'm assuming here you're intending to catch an AttributeError.
Because you have essentially the same code twice, and because the code is bulky, I'd factor that out into its own function.

Something like:

import bs4

def eliminate_newlines(tag: bs4.element.Tag):  # Maybe pick a better name
    try:
        return tag.text.replace('\n', '')

    except AttributeError:  # I'm assuming this is what you intend to catch
        return ""

Now that with open block is much neater:

with open('data.csv', "a", encoding='UTF-8') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')    
    for tr in My_table.find_all('tr')[2:]: # [2:] is to skip empty and header 
        tds = tr.find_all('td')
        
        title = eliminate_newlines(tds[0])
        year = eliminate_newlines(tds[2])

        writer.writerow([title, year])

If you're using a newer version of Python, lines like:

"{}, {}".format(tds[0].text.replace('\n',''), tds[2].text.replace('\n',''))

Can make use of f-strings to do in-place string interpolation:

f"{tds[0].text.replace('\n', '')}, {tds[2].text.replace('\n', '')}"

In this particular case, the gain isn't much. They're very helpful for more complicated formatting though.

Stack Exchange Network

Return to Answer