Revisions to Fastest and most efficient way to get number of records (lines) in a gzip-compressed file

added 582 characters in body

Source Link

edited May 8, 2017 at 18:51

1.2k
1
10
8

The problem with all the pipelines is that you are essentially doubling the work. No matter how fast the decompression is, the data still need to be shuttled to another process.

Perl has PerlIO::gzip which allows you to read gzipped streams directly. Therefore, it might offer an advantage even if its decompression speed may not match that of unpigz:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in or croak "Failed to close '$ARGV[0]': $!";

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM and an old ThinkPad T400 with 8 GB RAM with the file already in the cache. On the Mac, the Perl script was significantly faster than using pipelines (5 seconds vs 22 seconds), but on ArchLinux, it lost to unpigz:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.47
sys 0.01

versus

$ time -p unpigz -c spy.gz | wc -l
1154737
real 3.68
user 4.10
sys 1.46

and

$ time -p zcat spy.gz | wc -l
1154737
real 6.41
user 6.08
sys 0.86

Clearly, using unpigz -c file.gz | wc -l is the winner here both in terms of speed. And, that simple command line surely beats writing a program, however short.

The problem with all the pipelines is that you are essentially doubling the work. No matter how fast the decompression is, the data still need to be shuttled to another process.

Perl has PerlIO::gzip which allows you to read gzipped streams directly. Therefore, it might offer an advantage even if its decompression speed may not match that of unpigz:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in or croak "Failed to close '$ARGV[0]': $!";

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM and an old ThinkPad T400 with 8 GB RAM with the file already in the cache. On the Mac, the Perl script was significantly faster than using pipelines (5 seconds vs 22 seconds), but on ArchLinux, it lost to unpigz:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.47
sys 0.01

versus

$ time -p unpigz -c spy.gz | wc -l
1154737
real 3.68
user 4.10
sys 1.46

and

$ time -p zcat spy.gz | wc -l
1154737
real 6.41
user 6.08
sys 0.86

The problem with all the pipelines is that you are essentially doubling the work. No matter how fast the decompression is, the data still need to be shuttled to another process.

Perl has PerlIO::gzip which allows you to read gzipped streams directly. Therefore, it might offer an advantage even if its decompression speed may not match that of unpigz:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in or croak "Failed to close '$ARGV[0]': $!";

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM and an old ThinkPad T400 with 8 GB RAM with the file already in the cache. On the Mac, the Perl script was significantly faster than using pipelines (5 seconds vs 22 seconds), but on ArchLinux, it lost to unpigz:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.47
sys 0.01

versus

$ time -p unpigz -c spy.gz | wc -l
1154737
real 3.68
user 4.10
sys 1.46

and

$ time -p zcat spy.gz | wc -l
1154737
real 6.41
user 6.08
sys 0.86

Clearly, using unpigz -c file.gz | wc -l is the winner here both in terms of speed. And, that simple command line surely beats writing a program, however short.

added 582 characters in body

Source Link

edited May 8, 2017 at 18:46

Sinan Ünür

1.2k
1
10
8

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

my $lines;

{
    local $/ = 2 * 1024;
   1 while (my $chunk = <$in>) {<$in>;
        $lines += ($chunk =~ tr{\n}{\n} );
   print }
}"$.\n";

close $in
    or croak "Failed to close '$ARGV[0]': $!";

print "$lines\n";

I don't gave a large enough data file to test it with, but I would be interested to find out if it offered any benefits in your case.

Update:

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM and an old ThinkPad T400 with 8 GB RAM with the file already in the cache:

$ /usr/bin/time -p gzcat spy.gz| wc -l
real        22.24
user         2.67
sys          8.45
 1154737

and

$ /usr/bin/time -p unpigz -c spy.gz| wc -l
real        25.56
user         3.25
sys          9.98
 1154737

versus

$ /usr/bin/time -p ./gzlc.pl spy.gz
real         7.70
user         7.38
sys          0.26
1154737

. On the other handMac, trying the samePerl script with the same data file on mywas significantly faster than using pipelines T400 with ArchLinux installed(5 seconds vs 22 seconds), I getbut on ArchLinux, it lost to unpigz:

$ time -p ./gzlc.pl spy.gz 
1154737
 
real    0m64.836s49
user    0m64.617s47
sys 0m00.177s01

$ time zcat spy.gz | wc -l
1154737

real    0m5.865s
user    0m5.587s
sys 0m0.710s

and

$ timep unpigz -c spy.gz | wc -l
1154737
 
real    0m33.745s68
user    0m34.903s10
sys 0m11.327s46

confirming @marcelm and @rudimeier's remarks. So, the answer indeed seems to be "use unpigz!"

Finally, it turns out reading in chunks and counting newlines in them was a premature optimization with a severe penalty. Here is the fastest Perl version:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in
    or croak "Failed to close '$ARGV[0]': $!";

which performs much better:

$ time -p ./gzlc.plzcat spy.gz | wc -l
1154737
real 46.4941
user 46.4808
sys 0.0086

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

my $lines;

{
    local $/ = 2 * 1024;
    while (my $chunk = <$in>) {
        $lines += ($chunk =~ tr{\n}{\n} );
    }
}

close $in
    or croak "Failed to close '$ARGV[0]': $!";

print "$lines\n";

I don't gave a large enough data file to test it with, but I would be interested to find out if it offered any benefits in your case.

Update:

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM with the file already in the cache:

$ /usr/bin/time -p gzcat spy.gz| wc -l
real        22.24
user         2.67
sys          8.45
 1154737

and

$ /usr/bin/time -p unpigz -c spy.gz| wc -l
real        25.56
user         3.25
sys          9.98
 1154737

versus

$ /usr/bin/time -p ./gzlc.pl spy.gz
real         7.70
user         7.38
sys          0.26
1154737

On the other hand, trying the same script with the same data file on my T400 with ArchLinux installed, I get:

$ time ./gzlc.pl spy.gz 
1154737
 
real    0m6.836s
user    0m6.617s
sys 0m0.177s

$ time zcat spy.gz | wc -l
1154737

real    0m5.865s
user    0m5.587s
sys 0m0.710s

and

$ time unpigz -c spy.gz | wc -l
1154737
 
real    0m3.745s
user    0m3.903s
sys 0m1.327s

confirming @marcelm and @rudimeier's remarks. So, the answer indeed seems to be "use unpigz!"

Finally, it turns out reading in chunks and counting newlines in them was a premature optimization with a severe penalty. Here is the fastest Perl version:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in
    or croak "Failed to close '$ARGV[0]': $!";

which performs much better:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.48
sys 0.00

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in or croak "Failed to close '$ARGV[0]': $!";

I tried it with a 13 MB gzip compressed file (decompresses to 1.4 GB) on an old 2010 MacBook Pro with 16 GB RAM and an old ThinkPad T400 with 8 GB RAM with the file already in the cache. On the Mac, the Perl script was significantly faster than using pipelines (5 seconds vs 22 seconds), but on ArchLinux, it lost to unpigz:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.47
sys 0.01

$ time -p unpigz -c spy.gz | wc -l
1154737
real 3.68
user 4.10
sys 1.46

and

$ time -p zcat spy.gz | wc -l
1154737
real 6.41
user 6.08
sys 0.86

added 582 characters in body

Source Link

edited May 8, 2017 at 18:31

Sinan Ünür

1.2k
1
10
8

Finally, it turns out reading in chunks and counting newlines in them was a premature optimization with a severe penalty. Here is the fastest Perl version:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in
    or croak "Failed to close '$ARGV[0]': $!";

which performs much better:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.48
sys 0.00

Finally, it turns out reading in chunks and counting newlines in them was a premature optimization with a severe penalty. Here is the fastest Perl version:

#!/usr/bin/env perl

use strict;
use warnings;

use autouse Carp => 'croak';
use PerlIO::gzip;

@ARGV
    or croak "Need filename\n";

open my $in, '<:gzip', $ARGV[0]
    or croak "Failed to open '$ARGV[0]': $!";

1 while <$in>;

print "$.\n";

close $in
    or croak "Failed to close '$ARGV[0]': $!";

which performs much better:

$ time -p ./gzlc.pl spy.gz 
1154737
real 4.49
user 4.48
sys 0.00

added 582 characters in body

Source Link

edited May 8, 2017 at 18:24

Sinan Ünür

1.2k
1
10
8

Loading

added 2 characters in body

Source Link

edited May 8, 2017 at 17:01

Sinan Ünür

1.2k
1
10
8

Loading

deleted 180 characters in body

Source Link

edited May 8, 2017 at 16:46

Sinan Ünür

1.2k
1
10
8

Loading

added 313 characters in body

Source Link

edited May 8, 2017 at 16:26

Sinan Ünür

1.2k
1
10
8

Loading

added 499 characters in body

Source Link

edited May 8, 2017 at 16:19

Sinan Ünür

1.2k
1
10
8

Loading

Source Link

answered May 8, 2017 at 16:04

Sinan Ünür

1.2k
1
10
8

Loading

Stack Exchange Network

Return to Answer

Update:

Update: