5
\$\begingroup\$

I am 12 days old into Python and web scraping and managed to write my first ever automation script. Please review my code and point out blunders If any.

What do I want to achieve?

I want to scrape all chapters of each Novel in each category and post it on a WordPress blog to test. Please point out anything that I missed, and is mandatory to run this script on the WordPress blog.

from requests import get
from bs4 import BeautifulSoup
import re


r = get(site,
        headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
soup = BeautifulSoup(r.text, "lxml")
category = soup.findAll(class_="search-by-genre")

# Getting all categories
categories = []
for link in soup.findAll(href=re.compile(r'/category/\w+$')):
    print("Category:", link.text)
    category_link = link['href']
    categories.append(category_link)


# Getting all Novel Headers
for category in categories:
    r = get(category_link,
            headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
    soup = BeautifulSoup(r.text, "lxml")
    Novels_header = soup.findAll(class_="top-novel-header")


    # Getting Novels' Title and Link
    for Novel_names in Novels_header:
        print("Novel:", Novel_names.text.strip())

        Novel_link = Novel_names.find('a')['href']

        # Getting Novel's Info
        r = get(Novel_link, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
        soup = BeautifulSoup(r.text, "lxml")
        Novel_divs = soup.findAll(class_="chapter-chs")

        # Novel Chapters
        for articles in Novel_divs:
            article_ch = articles.findAll("a")
            for chapters in article_ch:
                ch = chapters["href"]


                # Getting article
                r = get(ch, headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"})
                soup = BeautifulSoup(r.content, "lxml")
                title = soup.find(class_="block-title")
                print(title.text.strip())
                full_article = soup.find("div", {"class": "desc"})

                # remove ads inside the text:
                for ads in full_article.select('center, small, a'):
                    ads.extract()

                print(full_article.get_text(strip=True, separator='\n'))
\$\endgroup\$

2 Answers 2

4
\$\begingroup\$

Naming

Variable names should be snake_case, and should represent what they are containing. I would also use req instead of r. The extra two characters aren't going to cause a heartache.

Constants

You have the same headers dict in four different places. I would instead define it once at the top of the file in UPPER_CASE, then just use that wherever you need headers. I would do the same for site.

List Comprehension

I would go about collecting categories in this way:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

It's shorter and utilizes a quirk in the python language. Of course, if you want to print out each one, then add this just after:

for category in categories:
    print(category)

Also, it seems like you assign category_link to the last element in the list, so that can go just outside the list comprehension.

Save your assignments

Instead of assigning the result of soup.find to a variable, then using it in a loop, simply put that soup.find in the loop. Take a look:

for articles in soup.findAll(class_="chapter-chs"):
    for chapters in articles.findAll("a"):
        ....


As a result of the above changes, you code would look something like this:

from requests import get
from bs4 import BeautifulSoup
import re

HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"}
SITE = "https://readlightnovel.org/"

req = get(SITE, headers=HEADERS)
soup = BeautifulSoup(req.text, "lxml")
category = soup.findAll(class_="search-by-genre")

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]
category_link = categories[-1]

# Getting all Novel Headers
for category in categories:
    req = get(category_link, headers=HEADERS)
    soup = BeautifulSoup(req.text, "lxml")
    novels_header = soup.findAll(class_="top-novel-header")


    # Getting Novels' Title and Link
    for novel_names in novels_header:
        print("Novel:", novel_names.text.strip())

        novel_link = novel_names.find('a')['href']

        # Getting Novel's Info
        req = get(novel_link, headers=HEADERS)
        soup = BeautifulSoup(req.text, "lxml")

        # Novel Chapters
        for articles in soup.findAll(class_="chapter-chs"):
            for chapters in articles.findAll("a"):
                ch = chapters["href"]

                # Getting article
                req = get(ch, headers=HEADERS)
                soup = BeautifulSoup(req.content, "lxml")
                title = soup.find(class_="block-title")
                print(title.text.strip())
                full_article = soup.find("div", {"class": "desc"})

                # remove ads inside the text:
                for ads in full_article.select('center, small, a'):
                    ads.extract()

                print(full_article.get_text(strip=True, separator='\n'))
\$\endgroup\$
3
\$\begingroup\$

I think you can even get rid of the regular expressions. I prefer to use the BS4 functions.

Instead of:

categories = [link['href'] for link in soup.findAll(href=re.compile(r'/category/\w+$'))]

This statement is equivalent using a CSS selector:

categories = [link['href'] for link in soup.select("a[href*=\/category\/]")]

That means: fetch all the a href tags that contain text /category/ (escaped).

\$\endgroup\$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.