3

I have a number of data files produced by some rather hackish script used in my lab. The script is quite entertaining in that the number of lines it appends before the header varies from file to file (though they are of the same format and have the same header).

I am writing a batch to process all of these files to dataframes. How can I make pandas identify the correct header if I do not know the position? I know the exact heder text, and the text of the two lines that come directly before it (they are the only consecutive instances of \r\n in the document).

I have tried to define null skipping at the end of the document and select the (thankfully) fixed number of data rows each file contains:

df = pd.read_csv(myfile, skipfooter=0, nrows=267)

That did not work.

Do you have any further ideas?

1
  • Not sure if there is a better Pandas way but could you pre-read the csv file, count the number of empty rows then use the skiprows named argument for read_csv? pandas.pydata.org/pandas-docs/stable/generated/… Commented Nov 26, 2013 at 23:32

1 Answer 1

3

You can open file and iterate it until consecutive \r\n are met, and pass result to parser, i.e.

with open(csv_file_name, 'rb') as source:
    consec_empty_lines = 0
    for line in source:
        if line == '\r\n':
            consec_empty_lines += 1
            if consec_empty_lines == 2: 
                break
        else:
            consec_empty_lines = 0
    df = pd.read_csv(source)
Sign up to request clarification or add additional context in comments.

7 Comments

hmmm... apparently the if statement won't react to my two blank lines - two blank lines means '\n\n' - right? There are also no tabs or spaces on those lines in my document... :-/
@TheChymera Imho, two blank lines are read as two consecutive \n, but you can test and see. As you didn't provide any test data, I didn't test it. I hope you get the idea and can elaborate solution serving your specific needs.
can I print the raw text somehow?
@TheChymera I don't completely get your question. For testing purposes you can add print line in a loop to check for skipped lines, or replace df = ... for print source.read() to check what's left. Don't forget to replace things back when running your batch.
found out how - the answer was print repr(line) - that lets me see what the "empty" lines actually contain. so both of them have actually have \r\n in them. Which makes finding them a it more problematic because I have other lines containing that (also in variable numbers 0.o) - I just have to think of some (~nice) way to check for the only to such lines which follow each other.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.