The problem with all the pipelines is that you are essentially doubling the work. No matter how fast the decompression is, the data still need to be shuttled to another process.
Perl has PerlIO::gzip which allows you to read gzipped streams directly. Therefore, it might offer an advantage even if its decompression speed may not match that of unpigz
:
#!/usr/bin/env perl
use strict;
use warnings;
use autouse Carp => 'croak';
use PerlIO::gzip;
@ARGV
or croak "Need filename\n";
open my $in, '<:gzip', $ARGV[0]
or croak "Failed to open '$ARGV[0]': $!";
my $lines;
{
local $/ = 2 * 1024;
while (my $chunk = <$in>) {
$lines += ($chunk =~ tr{\n}{\n} );
}
}
close $in
or croak "Failed to close '$ARGV[0]': $!";
print "$lines\n";
I don't gave a large enough data file to test it with, but I would be interested to find out if it offered any benefits in your case.
Update:
I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM with the file already in the cache:
$ /usr/bin/time -p gzcat spy.gz| wc -l real 22.24 user 2.67 sys 8.45 1154737
and
$ /usr/bin/time -p unpigz -c spy.gz| wc -l real 25.56 user 3.25 sys 9.98 1154737
versus
$ /usr/bin/time -p ./gzlc.pl spy.gz real 7.70 user 7.38 sys 0.26 1154737
On the other hand, trying the same script with the same data file on my T400 with ArchLinux installed, I get:
$ time ./gzlc.pl spy.gz 1154737 real 0m6.836s user 0m6.617s sys 0m0.177s
versus
$ time zcat spy.gz | wc -l 1154737 real 0m5.865s user 0m5.587s sys 0m0.710s
and
$ time unpigz -c spy.gz | wc -l 1154737 real 0m3.745s user 0m3.903s sys 0m1.327s
confirming @marcelm and @rudimeier's remarks. So, the answer indeed seems to be "use unpigz
!"
Finally, it turns out reading in chunks and counting newlines in them was a premature optimization with a severe penalty. Here is the fastest Perl version:
#!/usr/bin/env perl
use strict;
use warnings;
use autouse Carp => 'croak';
use PerlIO::gzip;
@ARGV
or croak "Need filename\n";
open my $in, '<:gzip', $ARGV[0]
or croak "Failed to open '$ARGV[0]': $!";
1 while <$in>;
print "$.\n";
close $in
or croak "Failed to close '$ARGV[0]': $!";
which performs much better:
$ time -p ./gzlc.pl spy.gz 1154737 real 4.49 user 4.48 sys 0.00