Using Pandas vs. CSV reader/writer to process and save large CSV file

Question

I'm fairly new to python and pandas but trying to get better with it for parsing and processing large data files. I'm currently working on a project that requires me to parse a few dozen large CSV CAN files at the time. The files have 9 columns of interest (1 ID and 7 data fields), have about 1-2 million rows, and are encoded in hex.

A sample bit of data looks like this:

   id   Flags   DLC Data0   Data1   Data2   Data3   Data4   Data5   Data6   Data7
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11a05  4      1   0                           
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11505  4      1   ff                          
cf11005  4      8   ff       ff      ff      ff      ff       ff      ff     ff

I need to decode the hex, and then extract a bunch of different variables from it depending on the CAN ID.

A colleague of mine wrote a script to parse these files that looks like this (henceforth known as Script #1):

import sys      # imports the sys module
import itertools
import datetime
import time
import glob, os

for filename in glob.glob(sys.argv[1] + "/*.csv"): 
    print('working on ' + filename +'...')

    #Initialize a bunch of variables

    csvInput = open(filename, 'r') # opens the csv file
    csvOutput = open((os.path.split(filename)[0] + os.path.split(filename)[1]), 'w', newline='')

    writer = csv.writer(csvOutput) #creates the writer object
    writer.writerow([var1, var2, var3, ...])

    try:
        reader = csv.reader(csvInput)
        data=list(reader)

        if (data[3][1] == 'HEX'): dataType = 16
        elif (data[3][1] == 'DEC'): dataType = 10
        else: print('Invalid Data Type')

        if (data[4][1] == 'HEX'): idType = 16
        elif (data[4][1] == 'DEC'): idType = 10
        else: print('Invalid ID Type') 

        start_date = datetime.datetime.strptime(data[6][1],'%Y-%m-%d %H:%M:%S')      

        for row in itertools.islice(data,8,None):
            try: ID = int(row[2],idType)
            except: ID = 0

            if (ID == 0xcf11005):
                for i in range(0,4): var1[i] = float((int(row[2*i+6],dataType)<<8)|

            #similar operations for a bunch of variables go here

            writer.writerow([var1[0], var2[1],.....])

    finally:
        csvInput.close()
        csvOutput.close()

print(end - start)
print('done')

It basically uses the CSV reader and writer to generate a processed CSV file line by line for each CSV. For a 2 million row CSV CAN file, it takes about 40 secs to fully run on my work desktop. Knowing that line by line iteration is much slower than performing vectorized operations on a pandas dataframe, I thought I could do better, so I wrote a script that looks like this (Script #2):

from timeit import default_timer as timer
import numpy as np
import pandas as pd
import os
import datetime
from tkinter import filedialog
from tkinter import Tk

Tk().withdraw()
filename = filedialog.askopenfile(title="Select .csv log file", filetypes=(("CSV files", "*.csv"), ("all files", "*.*")))

name = os.path.basename(filename.name)
##################################################
df = pd.read_csv(name, skiprows = 7, usecols = ['id', 'Data0', 'Data1', 'Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7'], 
                 dtype = {'id':str, 'Data0':str, 'Data1':str, 'Data2':str, 'Data3':str, 'Data4':str, 'Data5':str, 'Data6':str, 'Data7':str})

log_cols = ['id', 'Data0', 'Data1','Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7']

for col in log_cols: 
    df[col] = df[col].dropna().astype(str).apply(lambda x: int(x, 16))   

df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].fillna(method = 'ffill') #forward fill empty rows
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].fillna(value = 0) #replace any remaining nans with 0

df['Data0'] = df['Data0'].astype(np.uint8)
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].astype(np.uint8)

processed_df = pd.DataFrame(np.nan, index= range(0, len(df)), columns= ['var1' 'var2', 'var3', ...])

start_date = datetime.datetime.strptime('7/17/2018 14:12:48','%m/%d/%Y %H:%M:%S')

processed_df ['Time Since Start (s)']  = pd.read_csv(name, skiprows = 7, usecols = ['Time'], dtype = {'Time':np.float32}, engine = 'c')
processed_df['Date'] = pd.to_timedelta(processed_df['Time Since Start (s)'], unit = 's') + start_date
processed_df['id'] = df['id']

processed_df.loc[:, 'var1':'var37'] = processed_df.loc[:, 'var1':'var37'].astype(np.float32)

##################Data Processing###########################
processed_df.loc[processed_df.id == int(0xcf11005), 'var1'] = np.bitwise_or(np.left_shift(df['Data1'], 8), df['Data0'])/10
#a bunch of additional similar vectorized calculations go here to pull useful values

name_string = "Processed_" + name
processed_df.to_csv(name_string) #dump dataframe to CSV

The processing part was definitely faster, although not as much as I had hoped--it took about 13 seconds to process the 2 million row CSV file. There's probably some more I could do to optimize script #2, but that's a topic for another post.

Anyway, my hopes that script #2 would actually be faster than the script #1 one were dashed when I tried to save the dataframe as a CSV. the .to_csv() method took 40s alone! I tried playing around with a few parameters in the .to_csv() method, including chunk size and compression, as well as reducing the memory footprint of the dataframe, but even with these tweaks it still took 30s to save the dataframe, and once you factored in the initial processing time, the entire script was slower than the original row by row script #1.

Is row by row iteration of a CSV file really the most computationally efficient way to parse these files?

pynterest · Accepted Answer · 2018-08-08 17:58:10Z

3

The dask library might be worth a look. It implements a subset of the pandas DataFrame functionality, but stores the DataFrames on disk rather than in-memory, and allows you to use the DataFrame as if it were in memory. I believe it can even treat multiple files as a single DataFrame among other things like using multiple machines to do things in parallel.

This was faster for me when I was dealing with a 6GB CSV with millions of rows.

https://dask.pydata.org/en/latest/

answered Aug 8, 2018 at 17:58

pynterest

612 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

P-Rock Over a year ago

Thanks for the suggestion! If I used this library would I be taking a speed hit during some other part of the processing--like for example opening the CSV or performing vectorized calculations on the df?

pynterest Over a year ago

@P-Rock In my limited experience, it was actually faster to use this library (I think because it parallelizes the work underneath). For me, it took longer to read the file into memory with pandas' read_csv(). Dask's read_csv() was much faster. The docs have some discussion about performance that might be worth a look. dask.pydata.org/en/latest/understanding-performance.html

P-Rock Over a year ago

I was able to convert my pandas dataframe into a dask df, and then ran the dask .to_csv function and it took... 40.4s. Granted I still don't know very much about how to use DASK, so maybe there are some other things I could try with it to speed things up.

Alex · Accepted Answer · 2018-08-08 17:42:39Z

0

Have you tried setting a chunksize, the number of rows to write at a time, as you can see here above 100,000 it's set to 1.

Another thing to consider is adding mode='a' (from the default w) for appending.

So I would suggest using:

processed_df.to_csv(name_string, mode='a', chunksize=100000)

I'd play with the chunksize until it suits your needs.

answered Aug 8, 2018 at 17:42

Alex

7,1554 gold badges27 silver badges43 bronze badges

3 Comments

P-Rock Over a year ago

Tweaking these parameters didn't make much of a difference unfortunately. For chunk sizes of 100k, 10k, 1k, and 100, the run times were all around 40s. A chunk size of 10 took 77s, and a chunk size of 1 took so long I had to manually stop execution, Turning the append mode on/off didn't seem to make much of a difference either way--it would change run times by less than a second.

Alex Over a year ago

Fair enough, dask is maybe a good option then. If you're not limited to CSV, hdf5 should write very quickly, but they're a bit harder to use.

Alex Over a year ago

Another idea, maybe post to codereview to see if there are any tips or optimisations they have.

Collectives™ on Stack Overflow

Using Pandas vs. CSV reader/writer to process and save large CSV file

2 Answers 2

3 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Related