🧑‍💻 Web Scraping Indeed Remote Jobs and Storing in PostgreSQL

#webscraping #data #python

📌 Project Overview

In this project, I built a Python-based pipeline to automate the collection of remote job listings from Indeed. The project involved using Selenium and BeautifulSoup for web scraping, cleaning the extracted data with Pandas, and storing the final structured dataset into a PostgreSQL database for analysis or reporting.

This article walks you through each step of the pipeline, including:

Setting up the environment
Automating job searches with Selenium
Parsing job data using BeautifulSoup
Data cleaning and transformation
Storing the final dataset into PostgreSQL

🛠️ Step 1: Setting Up the Environment

First, I installed the necessary libraries:

!pip install selenium beautifulsoup4 pandas psycopg2-binary

I also downloaded the appropriate WebDriver (e.g., ChromeDriver) and ensured it's added to the system PATH.

🌐 Step 2: Navigating to the Website with Selenium

Using Selenium, I navigated to the Indeed search page and triggered a search for remote jobs in tech-related fields:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get("https://www.indeed.com")

# Search for remote jobs
search_job = driver.find_element(By.NAME, "q")
search_job.send_keys("Data Analyst Remote")
search_job.submit()

time.sleep(5)  # Wait for the results to load

🧩 Step 3: Parsing the HTML with BeautifulSoup

After the search results loaded, I passed the page source to BeautifulSoup to extract job details like title, company, location, and summary:

from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source, "html.parser")
job_cards = soup.find_all("div", class_="job_seen_beacon")

jobs = []

for card in job_cards:
    title = card.find("h2", class_="jobTitle").text.strip()
    company = card.find("span", class_="companyName").text.strip()
    location = card.find("div", class_="companyLocation").text.strip()
    summary = card.find("div", class_="job-snippet").text.strip()
    jobs.append([title, company, location, summary])

Once done, I closed the browser:

driver.quit()

📊 Step 4: Storing Data in a DataFrame and Cleaning It

I converted the list of jobs into a Pandas DataFrame and performed light cleaning:

import pandas as pd

df = pd.DataFrame(jobs, columns=["Job Title", "Company", "Location", "Summary"])
df.drop_duplicates(inplace=True)
df.head()

🗃️ Step 5: Storing the Data in PostgreSQL

Finally, I connected to a PostgreSQL database using psycopg2 and inserted the data:

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="job_scraper",
    user="your_username",
    password="your_password"
)
cur = conn.cursor()

cur.execute("""
CREATE TABLE IF NOT EXISTS remote_jobs (
    id SERIAL PRIMARY KEY,
    job_title TEXT,
    company TEXT,
    location TEXT,
    summary TEXT
)
""")

for index, row in df.iterrows():
    cur.execute("""
        INSERT INTO remote_jobs (job_title, company, location, summary)
        VALUES (%s, %s, %s, %s)
    """, (row['Job Title'], row['Company'], row['Location'], row['Summary']))

conn.commit()
cur.close()
conn.close()

📈 Conclusion

This project demonstrates how web scraping can automate data collection from job platforms and store insights in a relational database for deeper analysis. Whether you're building a job analytics dashboard or tracking market demand, this approach scales well and integrates with tools like Power BI or Tableau.