Python pandas: read file skipping commented

Question

I often deal with ascii tables containing few columns (normally less than 10) and up to tens of millions of lines. They look like

176.792 -2.30523 0.430772 32016 1 1 2 
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
...
177.403 -0.657246 0.432905 31152 1 1 1

I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it. But numpy.loadtxt takes at least 5-6Gb RAM to read 1Gb ascii file.

Yesterday I discovered Pandas, that solved almost all my problems: pandas.read_table together with numpy.savetxt improved the execution speed (of 2) of my scripts by a factor 3 or 4, while being very memory efficient.

All good until the point when I try to read in a file that contains a few commented lines at the beginning. The doc string (v=0.10.1.dev_f73128e) tells me that line commenting is not supported, and that will probably come. I think that this would be great: I really like the exclusion of line comments in numpy.loadtxt. Is there any idea on how this will become available? Would be also nice to have the possibility to skip those lines (the doc states that they will be returned as empy)

Not knowing how many comment lines I have in my files (I process thousands of them coming from different people), as now I open the file, count the number of lines starting with a comment at the beginning of the file:

def n_comments(fn, comment):
    with open(fname, 'r') as f:
        n_lines = 0
        pattern = re.compile("^\s*{0}".format(comment))
        for l in f:
            if pattern.search(l) is None:
                break
            else:
                n_lines += 1
    return n_lines

and then

pandas.read_table(fname, skiprows=n_comments(fname, '#'), header=None, sep='\s')

Is there any better way (maybe within pandas) to do it?

Finally, before posting, I looked a bit at the code in pandas.io.parsers.py to understand how pandas.read_table works under the hood, but I got lost. Can anyone point me to the places that implement the reading of the files?

Thanks

EDIT2: I thought to get some improvement getting rid of some of the if in @ThorstenKranz second implementation of FileWrapper, but did get almost no improvements

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal
        self._next = self._next_comment

    def next(self):
        return self._next()

    def _next_comment(self):
        while True:
            line = super(FileWrapper, self).next()
            if not line.strip()[0] == self._comment_literal:
                self._next = self._next_no_comment
                return line
    def _next_no_comment(self):
        return super(FileWrapper, self).next()

For the records: edited to include the function that counts the number of commented lines from the beginning of the file — Francesco Montesano
– Francesco Montesano, Commented Jan 11, 2013 at 16:52

Wes McKinney · Accepted Answer · 2013-01-11 14:57:15Z

read_csv and read_table have a comment option that will skip bytes starting from a comment character until the end of a line. If an entire line needs to be skipped, this isn't quite right because the parser will think that it's seen a line with no fields in it, then eventually see a valid data line and get confused.

I'd suggest using your workaround to determine the number of rows to skip manually in the file. It would be nice to have an option that enables automatically skipping lines when the entire line is a comment:

https://github.com/pydata/pandas/issues/2685

Implementing this well would require dipping into the C tokenizer code. It's not as bad as it might sound.

Thorsten Kranz · Accepted Answer · 2013-01-14 10:39:11Z

I found a compact solution by creating a class inheriting file:

import pandas as pd

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal

    def next(self):
        while True:
            line = super(FileWrapper, self).next()
            if not line.startswith(self._comment_literal):
                return line

df = pd.read_table(FileWrapper("#", "14276661.txt", "r"), delimiter=" ", header=None)

Atm, pandas (0.8.1) only uses the .next()-method to iterate over file-like objects. We can overload this method and only return those lines that do not start with the dedicated comment-literal, in my example "#".

For input file:

176.792 -2.30523 0.430772 32016 1 1 2 
# 177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
177.403 -0.657246 0.432905 31152 1 1 1

we get

>>> df
       X.1       X.2       X.3    X.4  X.5  X.6  X.7
0  176.792 -2.305230  0.430772  32016    1    1    2
1  177.047 -1.549570  0.431853  31136    1    1    1
2  177.403 -0.657246  0.432905  31152    1    1    1

and for

176.792 -2.30523 0.430772 32016 1 1 2 
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
177.403 -0.657246 0.432905 31152 1 1 1

we get

>>> df
       X.1       X.2       X.3    X.4  X.5  X.6  X.7
0  176.792 -2.305230  0.430772  32016    1    1    2
1  177.042 -1.877290  0.430562  32016    1    1    1
2  177.047 -1.549570  0.431853  31136    1    1    1
3  177.403 -0.657246  0.432905  31152    1    1    1

Instead of inheritance you could also use delegation, it is up to your flavor.

EDIT I tried many other ways to improve on the performance. It's a tough job, though. I tried

threading: Read file ahead in one thread with low-level io-operation and large chunks, split it into lines, enqueue these and only get from queue on next()
same with multiprocessing
similar, multithreaded approach but using readlines(size_hint)
mmap for reading from file

The first three approaches surprisingly were slower, so no benefit. Using a mmap significantly improved the performance. Here is the code:

class FileWrapper(file):
    def __init__(self, comment_literal, *args):
        super(FileWrapper, self).__init__(*args)
        self._comment_literal = comment_literal
        self._in_comment = True
        self._prepare()

    def __iter__(self):
        return self

    def next(self):
        if self._in_comment:
            while True:
                line = self._get_next_line()
                if line == "":
                    raise StopIteration()
                if not line[0] == self._comment_literal:
                    self._in_comment = False
                    return line
        line = self._get_next_line()
        if line == "":
            raise StopIteration()
        return line

    def _get_next_line(self):
        return super(FileWrapper, self).next()

    def _prepare(self):
        pass

class MmapWrapper(file):
    def __init__(self, fd, comment_literal = "#"):
        self._mm = mmap.mmap(fd, 0, prot=mmap.PROT_READ)
        self._comment_literal = comment_literal
        self._in_comment = True

    def __iter__(self):
        return self #iter(self._mm.readline, "")#self

    def next(self):
        if self._in_comment:
            while True:
                line = self._mm.readline()
                if line == "":
                    raise StopIteration()
                if not line[0] == self._comment_literal:
                    self._in_comment = False
                    return line
        line = self._mm.readline()
        if line == "":
            raise StopIteration()
        return line

if __name__ == "__main__":
    t0 = time.time()    
    for i in range(10):    
        with open("1gram-d_1.txt", "r+b") as f:
            df1 = pd.read_table(MmapWrapper(f.fileno()), delimiter="\t", header=None)
    print "mmap:", time.time()-t0

    t0 = time.time()    
    for i in range(10):    
        df2 = pd.read_table(FileWrapper("#", "1gram-d_1.txt", "r"), delimiter="\t", header=None)
    print "Unbuffered:", time.time()-t0

    print (df1==df2).mean()

gives as output

mmap: 35.3251504898
Unbuffered: 41.3274121284
X.1    1
X.2    1
X.3    1
X.4    1

I also implemented comment-checking only until the first non-comment line is found. This complies with your solution and further improves the performance.

For mmaps, there exist some restrictions, though. If file sizes are huge, be sure to have enough RAM. If you're working on a 32bit OS, you won't be able to read files larger than 2GB.

This is not a performant solution for reading files of the size the OP is describing.
Thanks for the solution. Reading 100 times a files with 426227 lines (6 commented) as here takes 2m36.088s (real time). The same with my workaround 1m55.929s.
I worked on performance and added another example. Maybe this will help you. @WesMcKinney: Ubuntu 12.10 ships pandas v0.8.0 by default, [debian packages upstream] are also at 0.8 for wheezy and sid, only experimental is at 0.9.1. Official release is at 10.0, I know, but released in December 2012. So most users won't have it yet.
I tried the two new approaches with the same setting of my previous comment and found using the new FileWrapper: 2m26.402s, MmapWrapper: 2m11.656s.
O.k., thank you. So in terms of performance it doesn't help, but, nevertheless, it was an interesting experiment in duck typing for me.

Collectives™ on Stack Overflow

Python pandas: read file skipping commented

2 Answers 2

1 Comment

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

6 Comments

Linked

Related