Skip to main content
7 of 9
added 582 characters in body
Sinan Ünür
  • 1.2k
  • 1
  • 10
  • 8

The problem with all the pipelines is that you are essentially doubling the work. No matter how fast the decompression is, the data still need to be shuttled to another process.

Perl has PerlIO::gzip which allows you to read gzipped streams directly. Therefore, it might offer an advantage even if its decompression speed may not match that of unpigz:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

my $lines;

{
    local $/ = 2 * 1024;
    while (my $chunk = <$in>) {
        $lines += ($chunk =~ tr{\n}{\n} );
    }
}

close $in
    or croak "Failed to close '$ARGV[0]': $!";

print "$lines\n";

I don't gave a large enough data file to test it with, but I would be interested to find out if it offered any benefits in your case.

Update:

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM with the file already in the cache:

$ /usr/bin/time -p gzcat spy.gz| wc -l
real        22.24
user         2.67
sys          8.45
 1154737

and

$ /usr/bin/time -p unpigz -c spy.gz| wc -l
real        25.56
user         3.25
sys          9.98
 1154737

versus

$ /usr/bin/time -p ./gzlc.pl spy.gz
real         7.70
user         7.38
sys          0.26
1154737

On the other hand, trying the same script with the same data file on my T400 with ArchLinux installed, I get:

$ time ./gzlc.pl spy.gz 
1154737

real    0m6.836s
user    0m6.617s
sys 0m0.177s

versus

$ time zcat spy.gz | wc -l
1154737

real    0m5.865s
user    0m5.587s
sys 0m0.710s

and

$ time unpigz -c spy.gz | wc -l
1154737

real    0m3.745s
user    0m3.903s
sys 0m1.327s

confirming @marcelm and @rudimeier's remarks. So, the answer indeed seems to be "use unpigz!"

Finally, it turns out reading in chunks and counting newlines in them was a premature optimization with a severe penalty. Here is the fastest Perl version:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in
    or croak "Failed to close '$ARGV[0]': $!";

which performs much better:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.48
sys 0.00
Sinan Ünür
  • 1.2k
  • 1
  • 10
  • 8