Building a Simple Web Scraper with Perl

Extracting Specific Data from HTML Elements

2 min readJan 31, 2025

Building a Simple Web Scraper with Perl Simple Example with Explanation — Image: Leonardo AI

🌈 Not a Member? click here to read full article

If you are one of the Perl programmer and you want to scraping data then this article will guide from the scratch. If you do not like to write code and scrape data from any platform then octoparse is the best tool for you. You just need to use pre-built template and set your data structure with single click. Boom! Your scraped data ready.

Prerequisites

Install Perl and required modules:

cpan install LWP::Simple
cpan install HTML::TreeBuilder

Code: Simple Web Scraper in Perl

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;

# URL to scrape
my $url = 'https://example.com';

# Fetch the webpage content
my $html_content = get($url);
die "Couldn't fetch the webpage!" unless defined $html_content;

# Parse the HTML content
my $tree = HTML::TreeBuilder->new;
$tree->parse($html_content);

# Extract titles (assuming <h2> tags contain the titles)
my @titles = $tree->look_down(_tag => 'h2');

print "Titles found on $url:\n\n";
foreach my $title (@titles) {
    print $title->as_text . "\n";
}

# Clean up
$tree->delete;

Code Explanation

Modules Used:

LWP::Simple: Fetches the webpage content.
HTML::TreeBuilder: Parses and allows querying the HTML DOM.

2. Steps:

Fetch the webpage content using get() from LWP::Simple.
Parse the HTML content using HTML::TreeBuilder.
Extract elements based on specific tags (e.g., <h2> for article titles) using the look_down() method.
Print the text content of each tag.

3. Error Handling:

The script ensures the webpage is fetched successfully; otherwise, it terminates with an error message.

4. Tree Cleanup:

The delete method frees memory used by the HTML tree.

Output

If the webpage contains article titles in <h2> tags, the script outputs:

Titles found on https://example.com:

Title 1
Title 2
Title 3
...

Extending the Script

Target Specific Attributes: Use the look_down() method with attributes:

my @links = $tree->look_down(_tag => 'a', class => 'article-link');

2. Save Output to a File: Write titles to a file:

open my $fh, '>', 'titles.txt' or die $!;
print $fh $_->as_text . "\n" for @titles;
close $fh;

3. Scrape Links or Images: Extract links or images using:

my @links = $tree->look_down(_tag => 'a');
my @images = $tree->look_down(_tag => 'img');

This simple script can be the starting point for creating more advanced scrapers, like handling pagination, logging, or integrating with databases for storing scraped data. Let me know if you’d like an example for these extensions!

Thank you for reading. Before you go 🙋‍♂️:

Please clap for the write 👏

🏌️‍♂️ Follow me: https://medium.com/@mayurkoshti12
🤹 Follow Publication: https://medium.com/the-code-compass