1

This question is kind of odd and complex, so bear with me, please.

I have several massive CSV files (GB size) that I am importing with pandas. These CSV files are dumps of data collected by a data acquisition system, and I don't need most of it, so I'm using the usecols parameter to filter out the relevant data. The issue is that not all of the CSV files have all of the columns I need (a property of the data system being used).

The problem is that, if the column doesn't exist in the file but is specified in usecols, read_csv throws an error.

Is there a straightforward way to force a specified column set in a dataframe and have pandas just return blank rows if the column doesn't exist? I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

1 Answer 1

2

I thought about iterating over each column for each file and working the resulting series into the dataframe, but that seems inefficient and unwieldy.

Assuming some kind of master list all_cols_to_use, can you do something like:

def parse_big_csv(csvpath):
    with open(csvpath, 'r') as infile:
        header = infile.readline().strip().split(',')
        cols_to_use = sorted(set(header) & set(all_cols_to_use))
        missing_cols = sorted(set(all_cols_to_use) - set(header))
    df = pd.read_csv(csvpath, usecols=cols_to_use)
    df.loc[:, missing_cols] = np.nan
    return df

This assumes that you're okay with filling the missing columns with np.nan, but should work. (Also, if you’re concatenating the data frames, the missing columns will be in the final df and filled with np.nan as appropriate.)

Sign up to request clarification or add additional context in comments.

2 Comments

I do have a fixed list of columns that are either present or not (they don't move, they just don't exist because the system isn't initialized yet). np.nan isn't a problem, because the end result is a plot so that just won't get plotted. I don't follow your parenthetical - you're saying that, in a concatenated data frame, any rows that would be "blank" due to the column not existing for that file will be filled with np.nan, right?
If you concatenate df1 and df2, where df1 has column A and dg2 doesn't, the resultant dataframe will have column A and all of the values for df2 will be nans.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.