1

I'm having issues with Pandas taking up way too much RAM. I've got a file of 5.5gb with 2 columns of which I want to simply save all the unique values in the first column like so:

Main File
    Follower    Friend
0   12          260009730
1   12          17568791
2   12          22512883
3   12          15808761
4   12          10135072
5   12          988
6   12          22424855
7   13          9163182
8   14          22990962
9   15          7681662
10  15          17289517

to

Result File
     User
0    12
1    13
2    14
3    15

because of RAM limitations I'm importing the main file in pieces of 30, trying to purge the dataframes from memory and only appending the result file each time. After two iterations (out of thirty) the result file is 13.5mb. But it consistently crashes after the 6th iteration and I can see in my process management that python is taking up 4.5gb of RAM. I'm trying to call the garbage collector but apparantly it's not working, can you guys help me out? My code is as follows:

i = 0
userRelation = pd.DataFrame(columns=['User'])
Location = 'file.txt'
while i < 30:
    userRelationHelp = pd.DataFrame(columns=['User'])
    print(str(i))
    network = pd.read_csv(Location, sep="\t", header=None, encoding='Latin', low_memory=False, skiprows=(i * math.ceil(284884514/30)), nrows=(((i+1) * math.ceil(284884514/30))), names=['Follower', 'Friend'])

    userRelationHelp['User'] = network['Follower'].unique()
    userRelation = userRelation.append(userRelationHelp)
    lst = [userRelationHelp, network]
    del lst
    gc.collect()
    i += 1

From what I've read the last 3 lines before i += 1 should serve to purge the larger files from memory. After each iteration I can see my RAM used at the start of the cycle constantly increasing by ~200mb, and during the cycle it'll increase by more each run.

Base Python RAM usage before running above code: 76mb

Approximate Python RAM usage at start of cycle

0: 300
1: 800
2: 1000
3: 1300

Approximate Python RAM usage at end of cycle

0: 633
1: 2000
2: 2900
3: 3700

Can imagine point out what I'm doing or assuming incorrectly?

1
  • 3
    check out read_csv()s chunksize param. and perhaps dask Commented Jul 26, 2017 at 15:43

2 Answers 2

3

@ypnos gave you a perfect description of how it should be done in Pandaic way.

Here is my attempt to do code it:

Generate sample data and write it to CSV:

fn = r'c:/temp/data.csv'

pd.DataFrame(np.random.randint(1001, 9999, (10**5, 2))) \
  .to_csv(fn, sep='\t', index=False, header=None)

processing only those columns from the CSV that we need:

chunksize=10**2  # you may want to use 10**7 as a chunk size
reader = pd.read_csv(fn, sep='\t', usecols=[0], names=['Follower'], 
                     chunksize=chunksize, squeeze=True)

df = pd.DataFrame(np.unique(np.concatenate([ser.unique() for ser in reader])),
                  columns=['User'])

Result:

In [62]: df
Out[62]:
      User
0     1001
1     1002
2     1003
3     1004
4     1005
5     1006
6     1007
7     1008
8     1009
9     1010
...    ...
8988  9989
8989  9990
8990  9991
8991  9992
8992  9993
8993  9994
8994  9995
8995  9996
8996  9997
8997  9998

[8998 rows x 1 columns]
Sign up to request clarification or add additional context in comments.

7 Comments

Beautiful! A one-line solution that also shows how to combine the chunk results.
@ypnos, thank you! I like your explanation (in your answer) very much!
Thanks man! Excellent answer, directly usable. Wish I knew about chunksize before, could've done a lot of large file management more efficient.
@user3394131, glad I could help :)
@user3394131, yes, ser is a Pandas.Series for each chunk. Parameter squeeze=True converts DataFrame with only one column into Series
|
2

Your way of splitting the huge file is extremely inefficient. When you use the skiprows parameter, the reader needs to go through the file line by line, counting line-ending characters, until the requested amount of rows has passed. So in your final iteration, the whole file is read again after you crunched through most of it multiple times already.

Please note in passing that your usage of nrows looks wrong to me, too. I would expect nrows=math.ceil(284884514/30). I believe this is the explanation to your memory problem.

What you need instead is an iterative way of running through the file at once, which is provided by the chunksize parameter. The use of chunksize is very well illustrated in the Pandas IO Tools docs.

Illustratory code example:

userRelation = pd.DataFrame(columns=['User'])
Location = 'file.txt'
chunksize = math.ceil(284884514/30)

reader = pd.read_csv(Location, sep="\t", header=None, encoding='Latin', low_memory=False, chunksize=chunksize, names=['Follower', 'Friend'])
for network in reader:
    userRelationHelp = pd.DataFrame(columns=['User'])
    userRelationHelp['User'] = network['Follower'].unique()
    userRelation = userRelation.append(userRelationHelp)
    lst = [userRelationHelp, network]

3 Comments

Very nice and detailed answer! ++ I think we can squeeze a little bit more by skipping the columns that we don't need ;-)
Thank you! I did not know exactly how nrows worked and that messed it up. Chunksize is way better in this context.
And you are absolutely right about the usage of nrows. I'm a buffoon and for some reason believed skiprow:nrow was the same as df[skiprow:nrow] which I know is totally wrong.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.