38

I have CSV files which I read in in pandas with:

#!/usr/bin/env python

import pandas as pd
import sys

filename = sys.argv[1]
df = pd.read_csv(filename)

Unfortunately, the last line of these files is often corrupt (has the wrong number of commas). Currently I open each file in a text editor and remove the last line.

Is it possible to remove the last line in the same python/pandas script that loads the CSV to save having to take this extra non-automated step?

3
  • You deleted a question about extracting numbers, anyway I was going to suggest using str.extract: for col in df.columns[2:]: df[col] = df[col].str.extract(r'(\d+)').astype(int) Commented Nov 13, 2015 at 9:55
  • @EdChum Does your code leave the decimal points? Commented Nov 13, 2015 at 10:01
  • @EdChum I undeleted the previous question. Commented Nov 13, 2015 at 10:06

3 Answers 3

40

Pass on_bad_lines='skip' and it will skip this line automatically

df = pd.read_csv(filename, on_bad_lines='skip')
  • The advantage of on_bad_lines='skip' is it will skip and not bork on any erroneous lines. But if the last line is always duff then skipfooter=1 is better.

  • Thanks to @DexterMorgan for pointing out that skipfooter option forces the engine to use the python engine which is slower than the c engine for parsing a csv.


and here is an old version (don't use - it is removed from pandas2.0):

df = pd.read_csv(filename, error_bad_lines=False)

Deprecated since version 1.3.0: The on_bad_lines parameter should be used instead to specify behavior upon encountering a bad line instead.

Sign up to request clarification or add additional context in comments.

6 Comments

Regarding the skipfooter option, it might be good to know that it doesn't work with the dtypes option: ValueError: Falling back to the 'python' engine because the 'c' engine does not support skipfooter, but this causes 'dtype' to be ignored as it is not supported by the 'python' engine. (Note the 'converters' option provides similar functionality.)
@DexterMorgan sure will add
There's an option 'skiprows', which is supported with c engine. If you know the number of lines of your csv you could add it as follows: pd.read_csv(filename, skiprows=[999]) (In my case there are 1000 lines) - note that you have to define rows in a list if you want to specify rows given their line number.
@Chaoste but the bad rows are at the end though, wouldn't you want nrows instead?
@EdChum I'm just looking into the documentation because I need it right now and I didn't see this option until now. Thank you! So In my case instead of skiprows=[1000] I had to write nrows=999. Another solution could be removing the last line via the command line which is very fast: head -n -1 dataframe.csv > temp.csv && mv temp.csv dataframe.csv
|
24

You can leave out the last n lines when reading in a csv by using the skipfooter argument:

df = pd.read_csv(filename, skipfooter=3, engine='python')

In this example the last 3 lines are ommited.

Comments

11

Read http://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.read_csv.html. Here 'skipfooter' argument can be used to specify no of lines that you don't want to read from .csv file from the end. May be It may help you.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.