Skip to main content
Use `values` instead of `keys`
Source Link
Reinderien
  • 71.1k
  • 5
  • 76
  • 256

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[nextnext(iter(pages.keysvalues()))]['revisions'][0]['*']['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[next(iter(pages.keys()))]['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Example and some commentary
Source Link
Reinderien
  • 71.1k
  • 5
  • 76
  • 256

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[next(iter(pages.keys()))]['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.

There are many informative links such as this StackOverflow question, but in the end reading the API documentation really is the right thing to do. This should get you started:

from pprint import pprint
import requests, wikitextparser

r = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'titles': 'Transistor_count',
        'prop': 'revisions',
        'rvprop': 'content',
        'format': 'json',
    }
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = pages[next(iter(pages.keys()))]['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')

pprint(doc.tables[0].data())

This may seem more roundabout than scraping the page, but API access gets you structured data, which bypasses an HTML rendering step that you shouldn't have to deal with. This structured data is the actual source of the article and is more reliable.

Source Link
Reinderien
  • 71.1k
  • 5
  • 76
  • 256

Is there a way to simplify this code?

Yes. Don't scrape Wikipedia. Your first thought before "should I need to scrape this thing?" should be "Is there an API that can give me the data I want?" In this case, there super is.