2
\$\begingroup\$

This is my first time web scraping, and here is the code I whipped up:

from bs4 import BeautifulSoup
import requests
import time

keywords = ['python']

for n in range(1000):
    res = requests.get(f"https://stackoverflow.com/questions?tab=newest&page={n}")
    time.sleep(3) # Sleep to avoid getting rate limited again
    soup = BeautifulSoup(res.text, "html.parser")
    questions = soup.select(".question-summary") # List of all question summaries in the current page
    for que in questions:
        found = False
        tagged = False
        q = que.select_one('.question-hyperlink').getText() # Store the title of the question
        for a in que.find_all('a', href=True):
            u = a['href'] # Store the link
            if u.split('/')[1] == 'questions' and u.split('/')[2] != 'tagged': # If this link is a question and not a tag
                res2 = requests.get("https://stackoverflow.com" + u) # Send request for that question
                time.sleep(3) # Extra precaution to avoid getting rate limited again
                soup2 = BeautifulSoup(res2.text, "html.parser")
                body = str(soup2.select(".s-prose")) # This is the body of the question
                if any(key in body for key in keywords):
                    found = True
            if 'tagged/python' in u:
                tagged = True

        if found and not tagged:
            print(q)

My code basically scrapes Stack Overflow posts newest first, and prints out all the posts that has the keyword "python" in its body, but no tag. I want to know, did I implement the algorithm optimally? Can you show me where to improve?

\$\endgroup\$
4
  • \$\begingroup\$ Can you comment/explain some of the code, especially towards the end? \$\endgroup\$ Commented Nov 9, 2020 at 21:31
  • \$\begingroup\$ @AMC Okay, done. \$\endgroup\$ Commented Nov 10, 2020 at 2:40
  • \$\begingroup\$ @pacmaninbw Hello, what's up? \$\endgroup\$ Commented Nov 10, 2020 at 3:23
  • \$\begingroup\$ Didn't realize the comments were requested by the poster of the answer. It is better not to edit the question after it has been answered since everyone should see what the person that answered saw. \$\endgroup\$ Commented Nov 10, 2020 at 3:50

1 Answer 1

1
\$\begingroup\$

The most important change: Check the tags before even getting the question's page. If it's tagged with python, then you know the question doesn't fit your criteria, regardless of what's in the body. Considering how popular the python tag is, this should save a good amount of time and processing.

I introduced a requests.Session object, which could improved performance. I also tweaked the variable names, and replaced the CSS selectors with the find()/find_all() methods. It's mostly a matter of personal preference, though.

Keep in mind that this code contains no error handling.

import requests
from bs4 import BeautifulSoup

tag_keywords = ['python']
content_keywords = ['python'.casefold()]

res = []

with requests.Session() as sess:
    for page_num in range(1):
        new_questions_page_req = sess.get(f'https://stackoverflow.com/questions?tab=newest&page={page_num}')
        soup = BeautifulSoup(new_questions_page_req.content, 'lxml')
        questions_container = soup.find('div', attrs={'id': 'questions', 'class': 'flush-left'})
        questions_list = questions_container.find_all('div', attrs={'class': 'question-summary'}, recursive=False)

        for curr_question_cont in questions_list:
            question_id = curr_question_cont['id'][17:]
            summary_elem = curr_question_cont.find('div', attrs={'class': 'summary'})
            tags_container = summary_elem.find('div', attrs={'class': 'tags'})
            tag_names = [elem.get_text() for elem in tags_container.find_all('a', recursive=False)]

            if not any(curr_tag in tag_keywords for curr_tag in tag_names):
                question_rel_url = summary_elem.find('h3').find('a', attrs={'class': 'question-hyperlink'})['href']
                question_page_req = sess.get(f'https://stackoverflow.com/q/{question_id}')
                question_page_soup = BeautifulSoup(question_page_req.content, 'lxml')
                question_body = question_page_soup.find('div', attrs={'class': 's-prose'})
                question_body_text = question_body.get_text().casefold()

                if any(curr_keyword in question_body_text for curr_keyword in content_keywords):
                    res.append(question_id)

print(res)
\$\endgroup\$
2
  • \$\begingroup\$ Wouldn't 'python'.casefold() do nothing because you've evaluated it before you used it? \$\endgroup\$ Commented Nov 10, 2020 at 14:07
  • \$\begingroup\$ @Chocolate What do you mean? The result of the case-folding is used as the element in the list literal. \$\endgroup\$ Commented Nov 11, 2020 at 3:07

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.