Read pandas dataframe from csv beginning with non-fix header

Question

I have a number of data files produced by some rather hackish script used in my lab. The script is quite entertaining in that the number of lines it appends before the header varies from file to file (though they are of the same format and have the same header).

I am writing a batch to process all of these files to dataframes. How can I make pandas identify the correct header if I do not know the position? I know the exact heder text, and the text of the two lines that come directly before it (they are the only consecutive instances of \r\n in the document).

I have tried to define null skipping at the end of the document and select the (thankfully) fixed number of data rows each file contains:

df = pd.read_csv(myfile, skipfooter=0, nrows=267)

That did not work.

Do you have any further ideas?

Not sure if there is a better Pandas way but could you pre-read the csv file, count the number of empty rows then use the skiprows named argument for read_csv? pandas.pydata.org/pandas-docs/stable/generated/… — will-hart
– will-hart, Commented Nov 26, 2013 at 23:32

alko · Accepted Answer · 2013-11-27 00:03:07Z

3

You can open file and iterate it until consecutive \r\n are met, and pass result to parser, i.e.

with open(csv_file_name, 'rb') as source:
    consec_empty_lines = 0
    for line in source:
        if line == '\r\n':
            consec_empty_lines += 1
            if consec_empty_lines == 2: 
                break
        else:
            consec_empty_lines = 0
    df = pd.read_csv(source)

edited Nov 27, 2013 at 0:03

answered Nov 26, 2013 at 23:32

alko

48.7k12 gold badges99 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

TheChymera Over a year ago

hmmm... apparently the if statement won't react to my two blank lines - two blank lines means '\n\n' - right? There are also no tabs or spaces on those lines in my document... :-/

alko Over a year ago

@TheChymera Imho, two blank lines are read as two consecutive \n, but you can test and see. As you didn't provide any test data, I didn't test it. I hope you get the idea and can elaborate solution serving your specific needs.

TheChymera Over a year ago

can I print the raw text somehow?

alko Over a year ago

@TheChymera I don't completely get your question. For testing purposes you can add print line in a loop to check for skipped lines, or replace df = ... for print source.read() to check what's left. Don't forget to replace things back when running your batch.

TheChymera Over a year ago

found out how - the answer was print repr(line) - that lets me see what the "empty" lines actually contain. so both of them have actually have \r\n in them. Which makes finding them a it more problematic because I have other lines containing that (also in variable numbers 0.o) - I just have to think of some (~nice) way to check for the only to such lines which follow each other.

|

Collectives™ on Stack Overflow

Read pandas dataframe from csv beginning with non-fix header

1 Answer 1

7 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Linked

Related