📌 Project Overview
In this project, I built a Python-based pipeline to automate the collection of remote job listings from Indeed. The project involved using Selenium and BeautifulSoup for web scraping, cleaning the extracted data with Pandas, and storing the final structured dataset into a PostgreSQL database for analysis or reporting.
This article walks you through each step of the pipeline, including:
- Setting up the environment
- Automating job searches with Selenium
- Parsing job data using BeautifulSoup
- Data cleaning and transformation
- Storing the final dataset into PostgreSQL
🛠️ Step 1: Setting Up the Environment
First, I installed the necessary libraries:
!pip install selenium beautifulsoup4 pandas psycopg2-binary
I also downloaded the appropriate WebDriver (e.g., ChromeDriver) and ensured it's added to the system PATH.
🌐 Step 2: Navigating to the Website with Selenium
Using Selenium, I navigated to the Indeed search page and triggered a search for remote jobs in tech-related fields:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://www.indeed.com")
# Search for remote jobs
search_job = driver.find_element(By.NAME, "q")
search_job.send_keys("Data Analyst Remote")
search_job.submit()
time.sleep(5) # Wait for the results to load
🧩 Step 3: Parsing the HTML with BeautifulSoup
After the search results loaded, I passed the page source to BeautifulSoup to extract job details like title, company, location, and summary:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")
job_cards = soup.find_all("div", class_="job_seen_beacon")
jobs = []
for card in job_cards:
title = card.find("h2", class_="jobTitle").text.strip()
company = card.find("span", class_="companyName").text.strip()
location = card.find("div", class_="companyLocation").text.strip()
summary = card.find("div", class_="job-snippet").text.strip()
jobs.append([title, company, location, summary])
Once done, I closed the browser:
driver.quit()
📊 Step 4: Storing Data in a DataFrame and Cleaning It
I converted the list of jobs into a Pandas DataFrame and performed light cleaning:
import pandas as pd
df = pd.DataFrame(jobs, columns=["Job Title", "Company", "Location", "Summary"])
df.drop_duplicates(inplace=True)
df.head()
🗃️ Step 5: Storing the Data in PostgreSQL
Finally, I connected to a PostgreSQL database using psycopg2 and inserted the data:
import psycopg2
conn = psycopg2.connect(
host="localhost",
database="job_scraper",
user="your_username",
password="your_password"
)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS remote_jobs (
id SERIAL PRIMARY KEY,
job_title TEXT,
company TEXT,
location TEXT,
summary TEXT
)
""")
for index, row in df.iterrows():
cur.execute("""
INSERT INTO remote_jobs (job_title, company, location, summary)
VALUES (%s, %s, %s, %s)
""", (row['Job Title'], row['Company'], row['Location'], row['Summary']))
conn.commit()
cur.close()
conn.close()
📈 Conclusion
This project demonstrates how web scraping can automate data collection from job platforms and store insights in a relational database for deeper analysis. Whether you're building a job analytics dashboard or tracking market demand, this approach scales well and integrates with tools like Power BI or Tableau.
🔗 Explore the Full Code on GitHub
Want to see the full source code for this project, including the complete Jupyter Notebook, scraping logic, and PostgreSQL integration?
👉 Check it out here: GitHub Repository - Scraping Indeed Remote Jobs
Feel free to clone it, star it ⭐, or fork it and customize for your own job data project!
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.