Scraping next page using BeautifulSoup

Question

I have created a script for article scraping - it finds title, subtitle, href-link, and the time of publication. Once retrieved, information is converted to a pandas dataframe, and the link for the next page is returned as well (so that it parses page after page).

Everything works as expected, though I feel there should be an easier -or more elegant- way of loading a subsequent page within main function.

import requests
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep


def read_page(url):
    r = requests.get(url)
    return BeautifulSoup(r.content, "lxml")


def news_scraper(soup):
    BASE = "https://www.pravda.com.ua"
    container = []
    for i in soup.select("div.news.news_all > div"):
        container.append(
            [
                i.a.text,  # title
                i.find(class_="article__subtitle").text,  # subtitle
                i.div.text,  # time
                BASE + i.a["href"],  # link
            ]
        )

    dataframe = pd.DataFrame(container, columns=["title", "subtitle", "time", "link"])
    dataframe["date"] = (
        dataframe["link"]
        .str.extract("(\d{4}/\d{2}/\d{2})")[0]
        .str.cat(dataframe["time"], sep=" ")
    )
    next_page = soup.select_one("div.archive-navigation > a.button.button_next")["href"]

    return dataframe.drop("time", axis=1), BASE + next_page


def main(START_URL):
    print(START_URL)

    results = []

    soup = read_page(START_URL)
    df, next_page = news_scraper(soup)
    results.append(df)

    while next_page:
        print(next_page)

        try:
            soup = read_page(next_page)
            df, next_page = news_scraper(soup)
            results.append(df)
        except:
            next_page = False

        sleep(1)

    return pd.concat([r for r in results], ignore_index=True)


if __name__ == "__main__":
    df = main("https://www.pravda.com.ua/archives/date_24122019/")
    assert df.shape == (120, 4) # it's true as of today, 12.26.2019

Note, the last consequent page ends with <a href="/archives/" class="button button_next">...</a> link, that assigns /archives/ to next_page . How are you handling that case? — RomanPerekhrest
– RomanPerekhrest, Commented Dec 26, 2019 at 19:16
@Zchpyvr Since I want to scrape a lot of pages, I thought that making a pause between requests would be necessary to avoid getting banned — Hryhorii Pavlenko
– Hryhorii Pavlenko, Commented Dec 26, 2019 at 19:30
@RomanPerekhrest Regarding archives I didn't think it all through. Since /archives would throw an error if I tried to scrape it with news_scraper function, I thought I'd just stop at that point and set next_page to False to quit while loop — Hryhorii Pavlenko
– Hryhorii Pavlenko, Commented Dec 26, 2019 at 19:32

RomanPerekhrest · Accepted Answer · 2019-12-27 06:33:41Z

Optimization and restructuring

Function's responsibility

The initial approach makes the read_page function depend on both requests and BeautifulSoup modules (though BeautifulSoup functionality/features is not actually used there). Then, a soup instance is passed to news_scraper(soup) function.
To reduce dependencies let read_page function extract the remote webpage and just return its contents r.content. That will also uncouple news_scraper from soup instance arguments and allow to pass any markup content, making the function more unified.

Namings

BASE = "https://www.pravda.com.ua" within news_scraper function is essentially acting like a local variable. But considering it as a constant - it should be moved out at top level and renamed to a meaningful BASE_URL = "https://www.pravda.com.ua".

i is not a good variable name to reflect a document element in for i in soup.select("div.news.news_all > div"). Good names are node, el, atricle ...

The main function is better renamed to news_to_df to reflect the actual intention.
main(START_URL) - don't give arguments uppercased names, it should be start_url.

Parsing news items and composing "date" value

As you parse webpages (html pages) - specifying html.parser or html5lib (not lxml) is preferable for creating BeautifulSoup instance.

Extracting an article publication time with generic i.div.text would be wrong as a parent node div.article could potentially contain another child div nodes with text content. Therefore, the search query should be more exact: news_time = el.find(class_='article__time').text.
Instead of assigning, traversing and dropping "time" column and aggregating:

dataframe["date"] = (
        dataframe["link"]
        .str.extract("(\d{4}/\d{2}/\d{2})")[0]
        .str.cat(dataframe["time"], sep=" ")
    )

- that all can be eliminated and the date column can be calculated at once by combining the extracted date value (powered by precompiled regex pattern DATE_PAT = re.compile(r'\d{4}/\d{2}/\d{2}')) and news_time value.

Instead of collecting a list of lists - a more robust way is to collect a list of dictionaries like {'title': ..., 'subtitle': ..., 'date': ..., 'link': ...} as that will prevent confusing the order of values for strict list of column names.

Furthermore, instead of appending to list, a sequence of needed dictionaries can be efficiently collected with generator function. See the full implementation below.

The main function (new name: news_to_df)

The while next_page: turned to while True:.

except: - do not use bare except, at least catch basic Exception class: except Exception:.

The repeated blocks of read_page, news_scraper and results.append(df) statements can be reduced to a single block (see below).
One subtle nuance is that the ultimate "next" page will have '/archives/' in its a.button.button_next.href path, signaling the end of paging. It's worth to handle that situation explicitly:

if next_page == '/archives/':
    break

The final optimized solution:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
import re

BASE_URL = "https://www.pravda.com.ua"
DATE_PAT = re.compile(r'\d{4}/\d{2}/\d{2}')


def read_page(url):
    r = requests.get(url)
    return r.content


def _collect_newsitems_gen(articles):
    for el in articles:
        a_node = el.a
        news_time = el.find(class_='article__time').text
        yield {'title': a_node.text, 
               'subtitle': el.find(class_="article__subtitle").text,
               'date': f'{DATE_PAT.search(a_node["href"]).group()} {news_time}',
               'link': f'{BASE_URL}{a_node["href"]}'}


def news_scraper(news_content):
    soup = BeautifulSoup(news_content, "html5lib")
    articles = soup.select("div.news.news_all > div")
    next_page_url = soup.select_one("div.archive-navigation > a.button.button_next")["href"]
    df = pd.DataFrame(list(_collect_newsitems_gen(articles)),
                      columns=["title", "subtitle", "date", "link"])

    return df, f'{BASE_URL}{next_page_url}'


def news_to_df(start_url):
    next_page = start_url
    results = []

    while True:
        print(next_page)
        try:
            content = read_page(next_page)
            df, next_page = news_scraper(content)
            results.append(df)
            if next_page == '/archives/':
                break
        except Exception:
            break

        sleep(1)

    return pd.concat([r for r in results], ignore_index=True)


if __name__ == "__main__":
    df = news_to_df("https://www.pravda.com.ua/archives/date_24122019/")        
    assert df.shape == (120, 4)  # it's true as of today, 12.26.2019

If printing the final resulting df with print(df.to_string()) - the output would look like below (with cutted the middle part to make it a bit shorter):

https://www.pravda.com.ua/archives/date_24122019/
https://www.pravda.com.ua/archives/date_25122019/
https://www.pravda.com.ua/archives/
                                                 title                                           subtitle              date                                               link
0    Голова Закарпаття не зрозумів, за що його звіл...  Голова Закарпатської обласної державної адміні...  2019/12/24 23:36  https://www.pravda.com.ua/news/2019/12/24/7235...
1    Стало відомо коли відновлять будівництво об'єк...  На зустрічі представників керівництва ХК Київм...  2019/12/24 22:41  https://www.pravda.com.uahttps://www.epravda.c...
2          ВАКС продовжив арешт Гримчаку до 14 лютого   Вищий антикорупційний продовжив арешт для коли...  2019/12/24 22:25  https://www.pravda.com.ua/news/2019/12/24/7235...
3    Економічні новини 24 грудня: транзит газу, зни...  Про транзит газу, про зниження "платіжок", про...  2019/12/24 22:10  https://www.pravda.com.uahttps://www.epravda.c...
4    Трамп: США готові до будь-якого "різдвяного по...  Президент США Дональд Трамп на тлі побоювань щ...  2019/12/24 22:00  https://www.pravda.com.uahttps://www.eurointeg...
5    У податковій слідчі дії – електронні сервіси п...  Державна податкова служба попереджає, що елект...  2019/12/24 21:55  https://www.pravda.com.ua/news/2019/12/24/7235...
6     Мінфін знизив ставки за держборгом до 11% річних  Міністерство фінансів знизило середньозважену ...  2019/12/24 21:31  https://www.pravda.com.uahttps://www.epravda.c...
7    Україна викреслила зі списку на обмін ексберку...  Російський адвокат Валентин Рибін заявляє, що ...  2019/12/24 21:13  https://www.pravda.com.ua/news/2019/12/24/7235...
8    Посол: іспанський клуб покарають за образи укр...  Посол України в Іспанії Анатолій Щерба заявив,...  2019/12/24 20:45  https://www.pravda.com.uahttps://www.eurointeg...
9    Міністр енергетики: "Газпром" може "зістрибнут...  У Міністерстві енергетики не виключають, що "Г...  2019/12/24 20:03  https://www.pravda.com.uahttps://www.epravda.c...
10   Зеленський призначив Арахамію секретарем Націн...  Президент Володимир Зеленський затвердив персо...  2019/12/24 20:00  https://www.pravda.com.ua/news/2019/12/24/7235...
...
110  Уряд придумав, як захистити українців від шкод...  Кабінет міністрів схвалив законопроєкт, який з...  2019/12/25 06:54  https://www.pravda.com.ua/news/2019/12/25/7235...
111  Кіберполіція та YouControl домовилися про спів...  Кіберполіція та компанія YouControl підписали ...  2019/12/25 06:00  https://www.pravda.com.ua/news/2019/12/25/7235...
112  В окупованому Криму продають прикарпатські яли...  У центрі Сімферополя, на новорічному ярмарку п...  2019/12/25 05:11  https://www.pravda.com.ua/news/2019/12/25/7235...
113  У США схожий на Санту чоловік пограбував банк,...  У Сполучених Штатах чоловік з білою, як у Сант...  2019/12/25 04:00  https://www.pravda.com.ua/news/2019/12/25/7235...
114  У Росії за "дитячу порнографію" посадили блоге...  Верховний суд російської Чувашії засудив до тр...  2019/12/25 03:26  https://www.pravda.com.ua/news/2019/12/25/7235...
115  Уряд провів екстрене засідання через газові пе...  Кабінет міністрів у вівторок ввечері провів ек...  2019/12/25 02:31  https://www.pravda.com.ua/news/2019/12/25/7235...
116  Нова стратегія Мінспорту: розвиток інфраструкт...  Стратегія розвитку спорту і фізичної активност...  2019/12/25 02:14  https://www.pravda.com.ua/news/2019/12/25/7235...
117  Милованов розкритикував НБУ за курс гривні та ...  Міністр розвитку економіки Тимофій Милованов р...  2019/12/24 01:46  https://www.pravda.com.uahttps://www.epravda.c...
118  Російські літаки розбомбили школу в Сирії: заг...  Щонайменше 10 людей, в тому числі шестеро – ді...  2019/12/25 01:04  https://www.pravda.com.ua/news/2019/12/25/7235...
119  Ліквідація "майданчиків Яценка": Зеленський пі...  Президент Володимир Зеленський підписав закон,...  2019/12/25 00:27  https://www.pravda.com.ua/news/2019/12/25/7235...

P.S. From Ukraine with love ...

\$\begingroup\$ Thanks so much, I learnt a lot. Дякую :) \$\endgroup\$

Hryhorii Pavlenko
– Hryhorii Pavlenko

2019-12-27 08:36:10 +00:00
Commented Dec 27, 2019 at 8:36 — Hryhorii Pavlenko
– Hryhorii Pavlenko, Commented Dec 27, 2019 at 8:36
\$\begingroup\$ @politicalscientist, you're welcome \$\endgroup\$

RomanPerekhrest
– RomanPerekhrest

2019-12-27 08:39:36 +00:00
Commented Dec 27, 2019 at 8:39 — RomanPerekhrest
– RomanPerekhrest, Commented Dec 27, 2019 at 8:39

Stack Exchange Network

Scraping next page using BeautifulSoup

1 Answer 1

Optimization and restructuring

P.S. From Ukraine with love ...

You must log in to answer this question.

Hot Network Questions

Scraping next page using BeautifulSoup

1 Answer 1

Optimization and restructuring

P.S. From Ukraine with love ...

You must log in to answer this question.

Related

Hot Network Questions