I have a case, where I am adding UUID columns to .csv files. At the same time, I am checking source files and comparing them to processed ones - in case there are additional lines in source files, I plan to append those new lines to destination file. Reason why I want to append and not overwrite a file is due to need of keeping UUID of previously processed lines same.
So for case of appending lines, I check if row count is same for source and destination file. In case it is not, I create new dataframe with data (from source file) from row number that equals row count in destination file.
At that point, I try to append newly created dataframe to destination dataframe, but it keeps failing. I recieve the following error:
> RuntimeWarning: '<' not supported between instances of 'int' and > 'str', sort order is undefined for incomparable objects result = > result.union(other)
Code that I am using is below:
import os, uuid
import pandas as pd
def process_files():
source_dir = "C:\\Projects\\test\\raw"
destination_dir = "C:\\Projects\\test\\processed"
for file_name in os.listdir(source_dir):
if file_name.endswith((".csv", ".new")):
df_source = pd.read_csv(source_dir + "/" + file_name, sep=";")
if os.path.isfile(destination_dir + "/" + file_name):
df_destination = pd.read_csv(destination_dir + "/" + file_name, sep=",", header=None)
if df_source.shape[0] != (df_destination.shape[0]):
df_newlines = pd.read_csv(source_dir + "/" + file_name, sep=";", skiprows=df_destination.shape[0], header=None)
df_newlines.insert(0, "uu_id", pd.Series([uuid.uuid4() for i in range(len(df_newlines))]))
df_destination.append(df_newlines, ignore_index=True)
df_destination.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
else:
continue
else:
df_source.insert(0,"uu_id", pd.Series([uuid.uuid4() for i in range(len(df_source))]))
df_source.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
else:
continue
process_files()
I have checked dtypes of both dataframes, they match per columns. I have also forced renaming of columns to have same string, but it does not do the trick. Any idea what I am doing wrong with append (commenting out the append row runs the script without issues)?
Thank you and best regards, Bostjan