78

I think I misunderstand the intention of read_csv. If I have a file 'j' like

# notes
a,b,c
# more notes
1,2,3

How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error

df = pandas.read_csv('j', comment='#')

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

I'm currently on

In [15]: pandas.__version__
Out[15]: '0.12.0rc1'

On version'0.12.0-199-g4c8ad82':

In [43]: df = pandas.read_csv('j', comment='#', header=None)

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

6
  • What is 'j'? I'm unable to reproduce the error when replacing 'j' with csv file path. Commented Aug 21, 2013 at 20:17
  • 4
    Have your tried df = pandas.read_csv('j', comment='#') Commented Aug 21, 2013 at 20:19
  • Sorry, b'#' was a typo. 'j' is an example file. It is a bug as Andy Hayden mentions below. Commented Aug 21, 2013 at 20:29
  • @mathtick weirdly I get slightly different error with the above code, but I've posted an issue with the CParserError you describe on github, I think it's a bug. Commented Aug 21, 2013 at 20:34
  • @AndyHayden ... yes, I grabbed the error from a loading a different file than shown in the example when I was in a rush. Just tried to reproduce it at home and discovered that the behavoiur appears to have already changed slightly the newer versions (tested on '0.12.0-199-g4c8ad82'). I've updated the example. Commented Aug 22, 2013 at 0:13

3 Answers 3

85

So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#' parameter into pd.read_csv and this should skip commented out lines.

These github issues shows that you can do this:

See the documentation on read_csv: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Sign up to request clarification or add additional context in comments.

1 Comment

What happens if I have more than 1 comment character? I found a data file which is a dump from java and it has # for comment lines and @ for block names and I want to skip both. In gnuplot I could set #@ as comment chars and it skipped both but in pandas it gives a error saying only single character is allowed.
21

One workaround is to specify skiprows to ignore the first few entries:

In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3'

In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1)
Out[12]: 
    a   b   c
0 NaN NaN NaN
1   1   2   3

Otherwise read_csv gets a little confused:

In [13]: pd.read_csv(StringIO(s), sep=',', comment='#')
Out[13]: 
        Unnamed: 0
a   b            c
NaN NaN        NaN
1   2            3

This seems to be the case in 0.12.0, I've filed a bug report.

As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely):

In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all')
Out[14]: 
   a  b  c
1  1  2  3

Note: the default index will "give away" the fact there was missing data.

1 Comment

And also since you don't want the comment lines just call .dropna(how='all').reset_index(drop=True) after.
5

I am on Pandas version 0.13.1 and this comments-in-csv problem still bothers me.

Here is my present workaround:

def read_csv(filename, comment='#', sep=','):
    lines = "".join([line for line in open(filename) 
                     if not line.startswith(comment)])
    return pd.read_csv(StringIO(lines), sep=sep)

Otherwise with pd.read_csv(filename, comment='#') I get

pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 16, saw 3.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.