Revisions to Python web-scraper to download table of transistor counts from Wikipedia

Use `values` instead of `keys`

Source Link

edited Sep 14, 2019 at 15:26

71.1k
5
76
256

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[nextnext(iter(pages.keysvalues()))]['revisions'][0]['*']['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[next(iter(pages.keys()))]['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Example and some commentary

Source Link

edited Sep 14, 2019 at 13:06

Reinderien

71.1k
5
76
256

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[next(iter(pages.keys()))]['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[next(iter(pages.keys()))]['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Source Link

answered Sep 13, 2019 at 13:59

Reinderien

71.1k
5
76
256

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

Stack Exchange Network

Return to Answer