"Data is the new oil, and web scraping is one of the drills."
Whether you’re gathering financial data, tracking competitor prices, or building datasets for machine learning projects, web scraping is a powerful tool to extract information from websites automatically.
In this blog post, we’ll explore:
- What web scraping is
- How it works
- Legal and ethical considerations
- Key Python tools for scraping
- A complete scraping project using
requests
,BeautifulSoup
, andpandas
- Bonus: Scraping dynamic websites using
Selenium
✅ What is Web Scraping?
Web scraping is the automated process of extracting data from websites. Think of it as teaching Python to browse the web, read pages, and pick out the data you're interested in.
⚖️ Is Web Scraping Legal?
Scraping publicly available data for personal, educational, or research purposes is usually okay. However:
-
Always check the website’s
robots.txt
file (www.example.com/robots.txt
) - Read the Terms of Service
- Avoid overloading servers with too many requests (use time delays)
- Never scrape private or paywalled content without permission
🧰 Popular Python Libraries for Web Scraping
Library | Purpose |
---|---|
requests |
To send HTTP requests |
BeautifulSoup |
To parse and extract data from HTML |
lxml |
A fast HTML/XML parser |
pandas |
To organize and analyze scraped data |
Selenium |
For dynamic websites with JavaScript |
playwright |
Modern alternative to Selenium |
🧪 Step-by-Step Web Scraping Example
Let’s scrape quotes from http://quotes.toscrape.com — a beginner-friendly practice site.
🛠️ Step 1: Install Required Libraries
pip install requests beautifulsoup4 pandas
🧾 Step 2: Send a Request and Parse HTML
import requests
from bs4 import BeautifulSoup
URL = "http://quotes.toscrape.com/page/1/"
response = requests.get(URL)
soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.text) # Output: Quotes to Scrape
🧮 Step 3: Extract the Quotes and Authors
quotes = []
authors = []
for quote in soup.find_all("div", class_="quote"):
text = quote.find("span", class_="text").text.strip()
author = quote.find("small", class_="author").text.strip()
quotes.append(text)
authors.append(author)
# Print sample
for i in range(3):
print(f"{quotes[i]} — {authors[i]}")
📊 Step 4: Store Data Using pandas
import pandas as pd
df = pd.DataFrame({
"Quote": quotes,
"Author": authors
})
print(df.head())
# Optional: Save to CSV
df.to_csv("quotes.csv", index=False)
🔁 Scrape Multiple Pages
all_quotes = []
all_authors = []
for page in range(1, 6): # First 5 pages
url = f"http://quotes.toscrape.com/page/{page}/"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
for quote in soup.find_all("div", class_="quote"):
all_quotes.append(quote.find("span", class_="text").text.strip())
all_authors.append(quote.find("small", class_="author").text.strip())
df = pd.DataFrame({"Quote": all_quotes, "Author": all_authors})
df.to_csv("all_quotes.csv", index=False)
🔄 Bonus: Scraping JavaScript-Rendered Sites using Selenium
Some sites load data dynamically with JavaScript, so requests
won't work.
🛠️ Install Selenium & WebDriver
pip install selenium
Download the appropriate ChromeDriver from https://chromedriver.chromium.org/downloads and add it to your system path.
🌐 Selenium Example
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import time
service = Service("chromedriver") # Path to your ChromeDriver
driver = webdriver.Chrome(service=service)
driver.get("https://quotes.toscrape.com/js/")
time.sleep(2) # Wait for JS to load
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()
for quote in soup.find_all("div", class_="quote"):
print(quote.find("span", class_="text").text.strip())
🧠 Best Practices for Web Scraping
- ✅ Use headers to mimic a browser:
headers = {"User-Agent": "Mozilla/5.0"}
requests.get(url, headers=headers)
- ✅ Add delays between requests using
time.sleep()
- ✅ Handle exceptions and errors gracefully
- ✅ Respect robots.txt and terms of use
- ✅ Use proxies or rotate IPs for large-scale scraping
📦 Real-World Use Cases
- 📰 News Monitoring (e.g., scraping articles for sentiment analysis)
- 🛒 E-commerce Price Tracking
- 📊 Competitor Research
- 🧠 Training Datasets for NLP/ML projects
- 🏢 Job Listings and Market Analysis
📌 Final Thoughts
Web scraping is a foundational tool in a data scientist’s arsenal. Mastering it opens up endless possibilities — from building custom datasets to powering AI models with real-world information.
“If data is fuel, then web scraping is how you build your own pipeline.”
Top comments (0)