Pandas Processing Large CSV Data

Question

I am processing a Large Data Set with at least 8GB in size using pandas.

I've encountered a problem in reading the whole set so I read the file chunk by chunk.

In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.

I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.

I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.

After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.

I'm a newbie in Python so it would really help if someone can point me in the right direction.

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
                                      low_memory=False)
    # new_df = pd.DataFrame()
    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
                       ' Step-3.csv'), mode='w', index=False, encoding='utf8')

If the three columns needed to remove dups can fit into your RAM requirements, you can read all rows for just those three columns by passing names parameter to read_csv. Then, drop duplicates to get the list of indices to keep and process the rest of the dataframe in the second pass. — hilberts_drinking_problem
– hilberts_drinking_problem, Commented Mar 11, 2020 at 5:27
I don't think that it will fit in my RAM requirements. I have 3,238,464,786‬ number of rows. — BLitE.exe
– BLitE.exe, Commented Mar 11, 2020 at 14:27
have you considered checking for NaN values instead of only empty values ? are you sure when you have missing values they're empty strings rather then NaN ? — Bernardo stearns reisen
– Bernardo stearns reisen, Commented Mar 21, 2020 at 7:14
Can you update your question and add some information on the structure of the data you're reading (columns, datatype, …)? You could do a df.describe(include='all') on a chunked df. — DocZerø
– DocZerø, Commented Mar 22, 2020 at 9:10

FBruzzesi · Accepted Answer · 2020-03-21 10:09:35Z

If you can fit in memory the set of unique keys:

def removeduplicates(filename):
    CHUNK_SIZE = 250000
    df_iterator = pd.read_csv(filename, na_filter=False, 
                              chunksize=CHUNK_SIZE,
                              low_memory=False)
    # create a set of (unique) ids
    all_ids = set()

    for df in df_iterator:
        df = df.dropna(subset=['Unique Keys'])
        df = df.drop_duplicates(subset=['Unique Keys'], keep='first')

        # Filter rows with key in all_ids
        df = df.loc[~df['Unique Keys'].isin(all_ids)]

        # Add new keys to the set
        all_ids = all_ids.union(set(df['Unique Keys'].unique()))

BlackBear · Accepted Answer · 2020-03-21 09:03:16Z

1

+50

Probably easier not doing it with pandas.

with open(input_csv_file) as fin:
    with open(output_csv_file) as fout:
        writer = csv.writer(fout)
        seen_keys = set()
        header = True
        for row in csv.reader(fin):
            if header:
                writer.writerow(row)
                header = False
                continue

            key = tuple(row[i] for i in key_indices)
            if not all(key):  # skip if key is empty
                continue

            if key not in seen_keys:
                writer.writerow(row)
                seen_keys.add(key)

answered Mar 21, 2020 at 9:03

BlackBear

23.1k10 gold badges52 silver badges90 bronze badges

2 Comments

BLitE.exe Over a year ago

is hashing the key can lessen the memory usage? Or it will just add an additional memory usage with the program?

BlackBear Over a year ago

@BLitE.exe I do not think seen_keys will give you any troubles, as it only stores unique keys (no duplicates) and you mentioned your CSV file is only 8GB.

villoro · Accepted Answer · 2020-03-23 14:20:24Z

1

I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.

As an example with Dask you could do:

import dask.dataframe as dd

df = dd.read_csv(filename, na_filter=False)

df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])

df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)

answered Mar 23, 2020 at 14:20

villoro

1,5491 gold badge11 silver badges15 bronze badges

Collectives™ on Stack Overflow

Pandas Processing Large CSV Data

3 Answers 3

Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Related