DEV Community

Cover image for 🕸️ Web Scraping in Python: A Practical Guide for Data Scientists
Vikas Gulia
Vikas Gulia

Posted on

🕸️ Web Scraping in Python: A Practical Guide for Data Scientists

"Data is the new oil, and web scraping is one of the drills."

Whether you’re gathering financial data, tracking competitor prices, or building datasets for machine learning projects, web scraping is a powerful tool to extract information from websites automatically.

In this blog post, we’ll explore:

  • What web scraping is
  • How it works
  • Legal and ethical considerations
  • Key Python tools for scraping
  • A complete scraping project using requests, BeautifulSoup, and pandas
  • Bonus: Scraping dynamic websites using Selenium

✅ What is Web Scraping?

Web scraping is the automated process of extracting data from websites. Think of it as teaching Python to browse the web, read pages, and pick out the data you're interested in.


⚖️ Is Web Scraping Legal?

Scraping publicly available data for personal, educational, or research purposes is usually okay. However:

  • Always check the website’s robots.txt file (www.example.com/robots.txt)
  • Read the Terms of Service
  • Avoid overloading servers with too many requests (use time delays)
  • Never scrape private or paywalled content without permission

🧰 Popular Python Libraries for Web Scraping

Library Purpose
requests To send HTTP requests
BeautifulSoup To parse and extract data from HTML
lxml A fast HTML/XML parser
pandas To organize and analyze scraped data
Selenium For dynamic websites with JavaScript
playwright Modern alternative to Selenium

🧪 Step-by-Step Web Scraping Example

Let’s scrape quotes from http://quotes.toscrape.com — a beginner-friendly practice site.

🛠️ Step 1: Install Required Libraries

pip install requests beautifulsoup4 pandas
Enter fullscreen mode Exit fullscreen mode

🧾 Step 2: Send a Request and Parse HTML

import requests
from bs4 import BeautifulSoup

URL = "http://quotes.toscrape.com/page/1/"
response = requests.get(URL)
soup = BeautifulSoup(response.text, "html.parser")

print(soup.title.text)  # Output: Quotes to Scrape
Enter fullscreen mode Exit fullscreen mode

🧮 Step 3: Extract the Quotes and Authors

quotes = []
authors = []

for quote in soup.find_all("div", class_="quote"):
    text = quote.find("span", class_="text").text.strip()
    author = quote.find("small", class_="author").text.strip()

    quotes.append(text)
    authors.append(author)

# Print sample
for i in range(3):
    print(f"{quotes[i]}{authors[i]}")
Enter fullscreen mode Exit fullscreen mode

📊 Step 4: Store Data Using pandas

import pandas as pd

df = pd.DataFrame({
    "Quote": quotes,
    "Author": authors
})

print(df.head())

# Optional: Save to CSV
df.to_csv("quotes.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

🔁 Scrape Multiple Pages

all_quotes = []
all_authors = []

for page in range(1, 6):  # First 5 pages
    url = f"http://quotes.toscrape.com/page/{page}/"
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")

    for quote in soup.find_all("div", class_="quote"):
        all_quotes.append(quote.find("span", class_="text").text.strip())
        all_authors.append(quote.find("small", class_="author").text.strip())

df = pd.DataFrame({"Quote": all_quotes, "Author": all_authors})
df.to_csv("all_quotes.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

🔄 Bonus: Scraping JavaScript-Rendered Sites using Selenium

Some sites load data dynamically with JavaScript, so requests won't work.

🛠️ Install Selenium & WebDriver

pip install selenium
Enter fullscreen mode Exit fullscreen mode

Download the appropriate ChromeDriver from https://chromedriver.chromium.org/downloads and add it to your system path.

🌐 Selenium Example

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time

service = Service("chromedriver")  # Path to your ChromeDriver
driver = webdriver.Chrome(service=service)

driver.get("https://quotes.toscrape.com/js/")
time.sleep(2)  # Wait for JS to load

soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

for quote in soup.find_all("div", class_="quote"):
    print(quote.find("span", class_="text").text.strip())
Enter fullscreen mode Exit fullscreen mode

🧠 Best Practices for Web Scraping

  • ✅ Use headers to mimic a browser:
headers = {"User-Agent": "Mozilla/5.0"}
requests.get(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode
  • ✅ Add delays between requests using time.sleep()
  • ✅ Handle exceptions and errors gracefully
  • ✅ Respect robots.txt and terms of use
  • ✅ Use proxies or rotate IPs for large-scale scraping

📦 Real-World Use Cases

  • 📰 News Monitoring (e.g., scraping articles for sentiment analysis)
  • 🛒 E-commerce Price Tracking
  • 📊 Competitor Research
  • 🧠 Training Datasets for NLP/ML projects
  • 🏢 Job Listings and Market Analysis

📌 Final Thoughts

Web scraping is a foundational tool in a data scientist’s arsenal. Mastering it opens up endless possibilities — from building custom datasets to powering AI models with real-world information.

“If data is fuel, then web scraping is how you build your own pipeline.”

Top comments (0)