I often deal with ascii tables containing few columns (normally less than 10) and up to tens of millions of lines. They look like
176.792 -2.30523 0.430772 32016 1 1 2
177.042 -1.87729 0.430562 32016 1 1 1
177.047 -1.54957 0.431853 31136 1 1 1
...
177.403 -0.657246 0.432905 31152 1 1 1
I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it. But numpy.loadtxt takes at least 5-6Gb RAM to read 1Gb ascii file.
Yesterday I discovered Pandas, that solved almost all my problems: pandas.read_table together with numpy.savetxt improved the execution speed (of 2) of my scripts by a factor 3 or 4, while being very memory efficient.
All good until the point when I try to read in a file that contains a few commented lines at the beginning. The doc string (v=0.10.1.dev_f73128e) tells me that line commenting is not supported, and that will probably come. I think that this would be great: I really like the exclusion of line comments in numpy.loadtxt.
Is there any idea on how this will become available? Would be also nice to have the possibility to skip those lines (the doc states that they will be returned as empy)
Not knowing how many comment lines I have in my files (I process thousands of them coming from different people), as now I open the file, count the number of lines starting with a comment at the beginning of the file:
def n_comments(fn, comment):
with open(fname, 'r') as f:
n_lines = 0
pattern = re.compile("^\s*{0}".format(comment))
for l in f:
if pattern.search(l) is None:
break
else:
n_lines += 1
return n_lines
and then
pandas.read_table(fname, skiprows=n_comments(fname, '#'), header=None, sep='\s')
Is there any better way (maybe within pandas) to do it?
Finally, before posting, I looked a bit at the code in pandas.io.parsers.py to understand how pandas.read_table works under the hood, but I got lost. Can anyone point me to the places that implement the reading of the files?
Thanks
EDIT2: I thought to get some improvement getting rid of some of the if in @ThorstenKranz second implementation of FileWrapper, but did get almost no improvements
class FileWrapper(file):
def __init__(self, comment_literal, *args):
super(FileWrapper, self).__init__(*args)
self._comment_literal = comment_literal
self._next = self._next_comment
def next(self):
return self._next()
def _next_comment(self):
while True:
line = super(FileWrapper, self).next()
if not line.strip()[0] == self._comment_literal:
self._next = self._next_no_comment
return line
def _next_no_comment(self):
return super(FileWrapper, self).next()