I am processing a Large Data Set with at least 8GB in size using pandas.
I've encountered a problem in reading the whole set so I read the file chunk by chunk.
In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.
I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.
I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.
After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.
I'm a newbie in Python so it would really help if someone can point me in the right direction.
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
low_memory=False)
# new_df = pd.DataFrame()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
' Step-3.csv'), mode='w', index=False, encoding='utf8')
namesparameter toread_csv. Then, drop duplicates to get the list of indices to keep and process the rest of the dataframe in the second pass.df.describe(include='all')on a chunked df.