Appending blank rows to dataframe if column does not exist

Question

This question is kind of odd and complex, so bear with me, please.

I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).

The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.

Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

pml · Accepted Answer · 2017-03-30 23:44:36Z

2

I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

Assuming some kind of master list all_cols_to_use, can you do something like:

def parse_big_csv(csvpath):
    with open(csvpath, 'r') as infile:
        header = infile.readline().strip().split(',')
        cols_to_use = sorted(set(header) & set(all_cols_to_use))
        missing_cols = sorted(set(all_cols_to_use) - set(header))
    df = pd.read_csv(csvpath, usecols=cols_to_use)
    df.loc[:, missing_cols] = np.nan
    return df

This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

edited Mar 30, 2017 at 23:44

answered Mar 30, 2017 at 19:50

pml

5143 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chris M. Over a year ago

I do have a fixed list of columns that are either present or not (they don't move, they just don't exist because the system isn't initialized yet). np.nan isn't a problem, because the end result is a plot so that just won't get plotted. I don't follow your parenthetical - you're saying that, in a concatenated data frame, any rows that would be "blank" due to the column not existing for that file will be filled with np.nan, right?

pml Over a year ago

If you concatenate df1 and df2, where df1 has column A and dg2 doesn't, the resultant dataframe will have column A and all of the values for df2 will be nans.

Collectives™ on Stack Overflow

Appending blank rows to dataframe if column does not exist

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related