I have currently 600 CSV files (and this number will grow) of 50K lines each i would like to put in one single dataframe. I did this, it works well and it takes 3 minutes :
colNames = ['COLUMN_A', 'COLUMN_B',...,'COLUMN_Z']
folder = 'PATH_TO_FOLDER'
# Dictionnary of type for each column of the csv which is not string    
dictTypes = {'COLUMN_B' : bool,'COLUMN_D' :int, ... ,'COLUMN_Y':float}
try:
   # Get all the column names, if it's not in the dict of type, it's a string and we add it to the dict
   dictTypes.update({col: str for col in colNames if col not in dictTypes})  
except:
    print('Problem with the column names.')
    
# Function allowing to parse the dates from string to date, we put in the read_csv method
cache = {}
def cached_date_parser(s):
    if s in cache:
        return cache[s]
    dt = pd.to_datetime(s, format='%Y-%m-%d', errors="coerce")
    cache[s] = dt
    return dt
# Concatenate each df in finalData
allFiles = glob.glob(os.path.join(folder, "*.csv")) 
finalData = pd.DataFrame()
finalData = pd.concat([pd.read_csv(file, index_col=False, dtype=dictTypes, parse_dates=[6,14],
                    date_parser=cached_date_parser) for file in allFiles ], ignore_index=True)
It takes one minute less without the parsing date thing. So i was wondering if i could improve the speed or it was a standard amount of time regarding the number of files. Thanks !
pd.concat()function will take not only sequences (eg,listortuple) but any iterable, so you don't need to create a never-used list. Instead, just givepd.concat()a generator expression -- a lightweight piece of code thatpd.concat()will execute on your behalf to populate the data frame. Like this:pd.concat((pd.read_csv(...) for file in allFiles), ...)\$\endgroup\$colNamesandfolderfrom? \$\endgroup\$date_parser=cached_date_parserwithinfer_datetime_format=Truein theread_csvcall? The API document says reading could be faster if the format is correctly inferred. \$\endgroup\$