Python Pandas append dataframe

Question

I have a case, where I am adding UUID columns to .csv files. At the same time, I am checking source files and comparing them to processed ones - in case there are additional lines in source files, I plan to append those new lines to destination file. Reason why I want to append and not overwrite a file is due to need of keeping UUID of previously processed lines same.

So for case of appending lines, I check if row count is same for source and destination file. In case it is not, I create new dataframe with data (from source file) from row number that equals row count in destination file.

At that point, I try to append newly created dataframe to destination dataframe, but it keeps failing. I recieve the following error:

> RuntimeWarning: '<' not supported between instances of 'int' and
> 'str', sort order is undefined for incomparable objects   result =
> result.union(other)

Code that I am using is below:

import os, uuid
import pandas as pd


def process_files():
    source_dir = "C:\\Projects\\test\\raw"
    destination_dir = "C:\\Projects\\test\\processed"

    for file_name in os.listdir(source_dir):
        if file_name.endswith((".csv", ".new")):
            df_source = pd.read_csv(source_dir + "/" + file_name, sep=";")

            if os.path.isfile(destination_dir + "/" + file_name):
                df_destination = pd.read_csv(destination_dir + "/" + file_name, sep=",", header=None)

                if df_source.shape[0] != (df_destination.shape[0]):
                    df_newlines = pd.read_csv(source_dir + "/" + file_name, sep=";", skiprows=df_destination.shape[0], header=None)
                    df_newlines.insert(0, "uu_id", pd.Series([uuid.uuid4() for i in range(len(df_newlines))]))
                    df_destination.append(df_newlines, ignore_index=True)
                    df_destination.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
                else:
                    continue
            else:
                df_source.insert(0,"uu_id", pd.Series([uuid.uuid4() for i in range(len(df_source))]))
                df_source.to_csv(destination_dir + "/" + file_name, sep=",", header=False, mode="w", index=False)
        else:
            continue


process_files()

I have checked dtypes of both dataframes, they match per columns. I have also forced renaming of columns to have same string, but it does not do the trick. Any idea what I am doing wrong with append (commenting out the append row runs the script without issues)?

Thank you and best regards, Bostjan

Eulenfuchswiesel · Accepted Answer · 2017-12-14 15:50:55Z

1

Disclaimer: Due to a lack of reputation points, I am not allowed to comment

Normally, append is not used in place. Hence, I would suggest to say

df_destination = df_destination.append(df_newlines, ignore_index=True)

Hope that's it.

Apart from that, I suggest to use os.walk and fnmatch to browse the files.

answered Dec 14, 2017 at 15:50

Eulenfuchswiesel

93911 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Bostjan Over a year ago

Hello! Thank you for help - it does solve my issue indeed. On the other hand, I did a workaround in the meantime (in case anyone would find it usefull as well). Instead of using append(), I have created new dataframe with missing lines and then used .to_csv(), with mode parameter set to "a". Best regards, Bostjan

Collectives™ on Stack Overflow

Python Pandas append dataframe

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related