3

I am processing a Large Data Set with at least 8GB in size using pandas.

I've encountered a problem in reading the whole set so I read the file chunk by chunk.

In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.

I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.

I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.

After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.

I'm a newbie in Python so it would really help if someone can point me in the right direction.

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
                                      low_memory=False)
    # new_df = pd.DataFrame()
    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
                       ' Step-3.csv'), mode='w', index=False, encoding='utf8')
4
  • 1
    If the three columns needed to remove dups can fit into your RAM requirements, you can read all rows for just those three columns by passing names parameter to read_csv. Then, drop duplicates to get the list of indices to keep and process the rest of the dataframe in the second pass. Commented Mar 11, 2020 at 5:27
  • I don't think that it will fit in my RAM requirements. I have 3,238,464,786‬ number of rows. Commented Mar 11, 2020 at 14:27
  • have you considered checking for NaN values instead of only empty values ? are you sure when you have missing values they're empty strings rather then NaN ? Commented Mar 21, 2020 at 7:14
  • Can you update your question and add some information on the structure of the data you're reading (columns, datatype, …)? You could do a df.describe(include='all') on a chunked df. Commented Mar 22, 2020 at 9:10

3 Answers 3

3

If you can fit in memory the set of unique keys:

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, 
                              chunksize=CHUNK_SIZE,
                              low_memory=False)
    # create a set of (unique) ids
    all_ids = set()

    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        # Filter rows with key in all_ids
        df = df.loc[~df['Unique Keys'].isin(all_ids)]

        # Add new keys to the set
        all_ids = all_ids.union(set(df['Unique Keys'].unique()))
Sign up to request clarification or add additional context in comments.

Comments

1
+50

Probably easier not doing it with pandas.

with open(input_csv_file) as fin:
    with open(output_csv_file) as fout:
        writer = csv.writer(fout)
        seen_keys = set()
        header = True
        for row in csv.reader(fin):
            if header:
                writer.writerow(row)
                header = False
                continue

            key = tuple(row[i] for i in key_indices)
            if not all(key):  # skip if key is empty
                continue

            if key not in seen_keys:
                writer.writerow(row)
                seen_keys.add(key)

2 Comments

is hashing the key can lessen the memory usage? Or it will just add an additional memory usage with the program?
@BLitE.exe I do not think seen_keys will give you any troubles, as it only stores unique keys (no duplicates) and you mentioned your CSV file is only 8GB.
1

I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.

As an example with Dask you could do:

import dask.dataframe as dd

df = dd.read_csv(filename, na_filter=False)

df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])

df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.