0

I have a case, where I am adding UUID columns to .csv files. At the same time, I am checking source files and comparing them to processed ones - in case there are additional lines in source files, I plan to append those new lines to destination file. Reason why I want to append and not overwrite a file is due to need of keeping UUID of previously processed lines same.

So for case of appending lines, I check if row count is same for source and destination file. In case it is not, I create new dataframe with data (from source file) from row number that equals row count in destination file.

At that point, I try to append newly created dataframe to destination dataframe, but it keeps failing. I recieve the following error:

> RuntimeWarning: '<' not supported between instances of 'int' and
> 'str', sort order is undefined for incomparable objects   result =
> result.union(other)

Code that I am using is below:

import os, uuid
import pandas as pd


def process_files():
    source_dir = "C:\\Projects\\test\\raw"
    destination_dir = "C:\\Projects\\test\\processed"

    for file_name in os.listdir(source_dir):
        if file_name.endswith((".csv", ".new")):
            df_source = pd.read_csv(source_dir + "/" + file_name, sep=";")

            if os.path.isfile(destination_dir + "/" + file_name):
                df_destination = pd.read_csv(destination_dir + "/" + file_name, sep=",", header=None)

                if df_source.shape[0] != (df_destination.shape[0]):
                    df_newlines = pd.read_csv(source_dir + "/" + file_name, sep=";", skiprows=df_destination.shape[0], header=None)
                    df_newlines.insert(0, "uu_id", pd.Series([uuid.uuid4() for i in range(len(df_newlines))]))
                    df_destination.append(df_newlines, ignore_index=True)
                    df_destination.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
                else:
                    continue
            else:
                df_source.insert(0,"uu_id", pd.Series([uuid.uuid4() for i in range(len(df_source))]))
                df_source.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
        else:
            continue


process_files()

I have checked dtypes of both dataframes, they match per columns. I have also forced renaming of columns to have same string, but it does not do the trick. Any idea what I am doing wrong with append (commenting out the append row runs the script without issues)?

Thank you and best regards, Bostjan

1 Answer 1

1

Disclaimer: Due to a lack of reputation points, I am not allowed to comment

Normally, append is not used in place. Hence, I would suggest to say

df_destination = df_destination.append(df_newlines, ignore_index=True)

Hope that's it.

Apart from that, I suggest to use os.walk and fnmatch to browse the files.

Sign up to request clarification or add additional context in comments.

1 Comment

Hello! Thank you for help - it does solve my issue indeed. On the other hand, I did a workaround in the meantime (in case anyone would find it usefull as well). Instead of using append(), I have created new dataframe with missing lines and then used .to_csv(), with mode parameter set to "a". Best regards, Bostjan

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.