10

I quickly wrote a C program extracting the i-th line of a set of gzipped files (containing about 500,000 lines). Here is my C program:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <zlib.h>

/* compilation:
gcc  -o linesbyindex -Wall -O3 linesbyindex.c -lz
*/
#define MY_BUFFER_SIZE 10000000
static void extract(long int index,const char* filename)
   {
   char buffer[MY_BUFFER_SIZE];
   long int curr=1;
   gzFile in=gzopen (filename, "rb");
   if(in==NULL)
       {
       fprintf(stderr,"Cannot open \"%s\" %s.\n",filename,strerror(errno));
       exit(EXIT_FAILURE);              }
   while(gzread(in,buffer,MY_BUFFER_SIZE)!=-1 && curr<=index)
       {
       char* p=buffer;
       while(*p!=0)
           {
           if(curr==index)
               {
               fputc(*p,stdout);
               }
           if(*p=='\n')
               {
               ++curr;
               if(curr>index) break;
               }
           p++;
           }
       }
   gzclose(in);
   if(curr<index)
       {
       fprintf(stderr,"Not enough lines in %s (%ld)\n",filename,curr);
       }
   }

int main(int argc,char** argv)
   {
   int optind=2;
   char* p2;
   long int count=0;
   if(argc<3)
       {
       fprintf(stderr,"Usage: %s (count) files...\n",argv[0]);
       return EXIT_FAILURE;
       }
   count=strtol(argv[1],&p2,10);
   if(count<1 || *p2!=0)
       {
       fprintf(stderr,"bad number %s\n",argv[1]);
       return EXIT_SUCCESS;
       }
   while(optind< argc)
       {
       extract(count,argv[optind]);
       ++optind;
       }
   return EXIT_SUCCESS;
   } 

As a test, I wrote the following equivalent code in java:

import java.io.*;
import java.util.zip.GZIPInputStream;

public class GetLineByIndex{
   private int index;

   public GetLineByIndex(int count){
       this.index=count;
   }

   private String extract(File file) throws IOException
       {
       long curr=1;
       byte buffer[]=new byte[2048];
       StringBuilder line=null;
       InputStream in=null;
       if(file.getName().toLowerCase().endsWith(".gz")){
           in= (new GZIPInputStream(new FileInputStream(file)));
       }else{
           in= (new FileInputStream(file));
       }
             int nRead=0;
       while((nRead=in.read(buffer))!=-1)
           {
           int i=0;
           while(i<nRead)
               {
               if(buffer[i]=='\n')
                   {
                   ++curr;
                   if(curr>this.index) break;
                                     }
               else if(curr==this.index)
                   {
                   if(line==null) line=new StringBuilder(500);
                   line.append((char)buffer[i]);
                   }
               i++;
               }
           if(curr>this.index) break;
           }
       in.close();
       return (line==null?null:line.toString());
       }

   public static void main(String args[]) throws Exception{
       int optind=1;
       if(args.length<2){
           System.err.println("Usage: program (count) files...\n");
           return;
       }
       GetLineByIndex app=new GetLineByIndex(Integer.parseInt(args[0]));

       while(optind < args.length)
           {
           String line=app.extract(new File(args[optind]));
           if(line==null)
               {
               System.err.println("Not enough lines in "+args[optind]);
               }
           else
               {
               System.out.println(line);
               }
           ++optind;
           }
       return;
   }
} 

It happens that the java program was much faster (~1'45'') to fetch a large index than the C program (~2'15'') on the same machine (I ran that test several times).

How can I explain that difference ?

10
  • 2
    Note: The buffersizes are not equal hence the programs do not do the "exact" same thing. Commented Jan 26, 2012 at 12:27
  • @SaniHuttunen - the code is not equivalent for more reasons than that :) Commented Jan 26, 2012 at 12:32
  • @Perception: True, but that was my first observation and seemed enough to point out that the programs are indeed not equal. Commented Jan 26, 2012 at 12:35
  • The C implementation has a 10Mb array instantiated on the process stack. Does that really run? Most processes have smaller stacks than that. Commented Jan 26, 2012 at 12:48
  • 6
    Perhaps the compiler generates poor code on purpose because of the highly unorthodox coding style. That's what I would have done, had I been a compiler. Commented Jan 26, 2012 at 15:50

5 Answers 5

22

The most likely explanation for the Java version to be faster than the C version is that the C version is incorrect.

After fixing the C version, I obtained the following results (contradicting your claim that Java is faster than C):

Java 1.7 -client: 65 milliseconds (after JVM warmed up)
Java 1.7 -server: 82 milliseconds (after JVM warmed up)
gcc -O3:          37 milliseconds

The task was to print the 200000-th line from file words.gz. File words.gz was generated by gzipping /usr/share/dict/words.


...
static char buffer[MY_BUFFER_SIZE];
...
ssize_t len;
while((len=gzread(in,buffer,MY_BUFFER_SIZE)) > 0  &&  curr<=index)
    {
    char* p=buffer;
    char* endp=buffer+len;
    while(p < endp)
       {
...
Sign up to request clarification or add additional context in comments.

10 Comments

what did you change in the C version please ?
Thanks ! the first time I wrote my C code, I used gzgets instead of gzread but I didn't change the test in the loop over the buffer.
@Pierre: I see. If you rerun the benchmark on your computer with your files, is C faster than Java now?
Yes I did the test at work a few hours ago. But the difference was not as large as I expected.
for the record the standard java's gzip is inefficient.
|
15

Because fputc() isn't very fast and you're adding stuf char-by-char in your output file.

calling fputc_unlocked or rather delimiting the stuff you want to add and call fwrite() should be faster.

2 Comments

Your answer is incorrect. The author of the question did not specify the average length of a line in his GZIP files.
fputc() is only used for a single line after skipping a huge number of supposedly similar lines. Not the inner loop we should be looking for. The huge automatic buffer is a better candidate. Making it the same size as in java (2048) would allow for a fair comparison.
12

Well your programs are doing different things. I didn't profile your program, but from looking at your code I suspect this difference:

For building the line, you use this in Java:

if(curr==this.index)
{
    if(line==null) line=new StringBuilder(500);
    line.append((char)buffer[i]);
}

And this in C:

if(curr==index)
{
    fputc(*p,stdout);
}

I.e. you're printing one character at a time to stdout. Which is buffere, by default, but I suspect it's still slower than the 500 character buffer you use in Java.

Comments

0

I do not have deeper knowledge about what optimizations the compiler performs, but I guess this what makes the difference between your programs. Microbenchmarks like this very, very, very hard to get right and meaningful. Here's an article by Brian Goetz that elaborates on this: http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html

Comments

0

Very large buffers can be slower. I would suggest you make the buffer size the same. i.e. both 2 or 8 KB

2 Comments

I started using stdio: BUFSIZ : ~ same result
In C (zlib) the large buffer does not matter at all, in java it does since it's copied multiple times. You can use a memory mapped file, just as well. Java's FileInputStream is (was?) optimized for smaller buffers 2K in Win, 8K - linux, in that case uses the stack to allocate, otherwise it's malloc/free (and some malloc are much slower than stack), that's why the smaller buffer performs better. I had horrid crashes in native memory when calling in deeper recursion, double SIGSEG and the process is dead (the 2nd happens when attempting to write the crash log, hence no crash log event)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.