Building a Simple Web Scraper with Perl
Extracting Specific Data from HTML Elements
🌈 Not a Member? click here to read full article
If you are one of the Perl programmer and you want to scraping data then this article will guide from the scratch. If you do not like to write code and scrape data from any platform then octoparse is the best tool for you. You just need to use pre-built template and set your data structure with single click. Boom! Your scraped data ready.
Prerequisites
Install Perl and required modules:
cpan install LWP::Simple
cpan install HTML::TreeBuilder
Code: Simple Web Scraper in Perl
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use HTML::TreeBuilder;
# URL to scrape
my $url = 'https://example.com';
# Fetch the webpage content
my $html_content = get($url);
die "Couldn't fetch the webpage!" unless defined $html_content;
# Parse the HTML content
my $tree = HTML::TreeBuilder->new;
$tree->parse($html_content);
# Extract titles (assuming <h2> tags contain the titles)
my @titles = $tree->look_down(_tag => 'h2');
print "Titles found on $url:\n\n";
foreach my $title (@titles) {
print $title->as_text . "\n";
}
# Clean up
$tree->delete;
Code Explanation
- Modules Used:
LWP::Simple
: Fetches the webpage content.HTML::TreeBuilder
: Parses and allows querying the HTML DOM.
2. Steps:
- Fetch the webpage content using
get()
fromLWP::Simple
. - Parse the HTML content using
HTML::TreeBuilder
. - Extract elements based on specific tags (e.g.,
<h2>
for article titles) using thelook_down()
method. - Print the text content of each tag.
3. Error Handling:
- The script ensures the webpage is fetched successfully; otherwise, it terminates with an error message.
4. Tree Cleanup:
- The
delete
method frees memory used by the HTML tree.
Output
If the webpage contains article titles in <h2>
tags, the script outputs:
Titles found on https://example.com:
Title 1
Title 2
Title 3
...
Extending the Script
- Target Specific Attributes: Use the
look_down()
method with attributes:
my @links = $tree->look_down(_tag => 'a', class => 'article-link');
2. Save Output to a File: Write titles to a file:
open my $fh, '>', 'titles.txt' or die $!;
print $fh $_->as_text . "\n" for @titles;
close $fh;
3. Scrape Links or Images: Extract links or images using:
my @links = $tree->look_down(_tag => 'a');
my @images = $tree->look_down(_tag => 'img');
This simple script can be the starting point for creating more advanced scrapers, like handling pagination, logging, or integrating with databases for storing scraped data. Let me know if you’d like an example for these extensions!
Thank you for reading. Before you go 🙋♂️:
Please clap for the write 👏
🏌️♂️ Follow me: https://medium.com/@mayurkoshti12
🤹 Follow Publication: https://medium.com/the-code-compass
🔎More Topics








