Pandas.read_csv() MemoryError

Question

I have a 1gb csv file. The file has about 10000000(10 Mil) rows. I need to iterate through the rows to get the max of a few selected rows(based on a condition). The issue is reading the csv file.

I use the Pandas package for Python. The read_csv() function throws the MemoryError while reading the csv file. 1) I have tried to split the file into chunks and read them, Now, the concat() function has a memory issue.

tp  = pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float,  'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float,   'vdd_ext_flash_v': float,   'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float})

df = pd.concat(tp,ignore_index=True)

I have used the dtype to reduce memory hog, still there is no improvement.

Based on multiple blog posts. I have updated numpy, pandas all of them to the latest version. Still no luck.

It would be great if anyone has a solution to this issue.

Please note:

I have a 64bit operating system(Windows 7)
I am running Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit]
I have 4GB Ram.
Numpy latest (pip installer says latest version installed)
Pandas Latest.(pip installer says latest version installed)

The issue is that you are trying to read the entire file into memory. The chunk read is the write approach, but you are undoing the good by trying to use pd.concat. You should compute the maxes for each chunk and carry only the maxes from chunk to chunk. I hope that make sense. — tipanverella
– tipanverella, Commented Mar 21, 2017 at 15:19
If what you are interested in to the max of columns, the best solution might be to write a generator that streams thru your csv file, and computes the maxes iteratively, i.e. bypassing pandas all together. — tipanverella
– tipanverella, Commented Mar 21, 2017 at 15:22
Running python 64 bit will utilize the full 4gb of memory. Your 32 bit python can only use a portion of the 4gb. stackoverflow.com/a/18282931/2327328 — philshem
– philshem, Commented Mar 21, 2017 at 15:31
If the condition for finding the max is not aggregate, then you'd be better off reading the csv line by line and storing the max value you want in variable. stackoverflow.com/a/38058980/2327328 — philshem
– philshem, Commented Mar 21, 2017 at 15:36

Guillaume · Accepted Answer · 2017-03-21 15:23:22Z

If the file you are trying to read is too large to be contained as a whole in memory, you also cannot read it in chunks then reassemble it in memory, because in the end that needs at least as much memory.

You could try to read the file in chuncks, filter out unnecessary rows in each chunck (based on the condition you are mentionning), then reassemble the remaining rows in a dataframe.

Which gives something like that:

df = pd.concat(apply_your_filter(chunck_df) for chunck_df in pd.read_csv('capture2.csv', iterator=True, chunksize=10000, dtype={'timestamp': float, 'vdd_io_soc_i': float, 'vdd_io_soc_v': float, 'vdd_io_plat_i': float, 'vdd_io_plat_v': float, 'vdd_ext_flash_i': float, 'vdd_ext_flash_v': float, 'vsys_i vsys_v': float, 'vdd_aon_dig_i': float, 'vdd_aon_dig_v': float, 'vdd_soc_1v8_i': float, 'vdd_soc_1v8_v': float}), ignore_index=True)

And/or find the max of each chunck, then the max of each of those chunck maxs.

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

Pandas read_csv() has a low memory flag.

tp  = pd.read_csv('capture2.csv',low_memory=True, ...)

The low_memory flag is only available if you use the C parser

engine : {‘c’, ‘python’}, optional

Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.

You can also use the memory_map flag

memory_map : boolean, default False

If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

source

p.s. use 64bit python - see my comment

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Mar 21, 2017 at 15:29

philshem

25.5k8 gold badges66 silver badges136 bronze badges

8 Comments

tipanverella Over a year ago

This is nice, but you might want to explore a solution that does this task using csv. You could still be limited by memory in the above implementation, but the type of computation you are trying to do, should only have a time limitation.

philshem Over a year ago

I added a comment that suggests reading csv line by line, without pandas, but perhaps OP needs aggregate function to calculate max (like moving average)

suhas Over a year ago

I have tried to make it work on a 64bit Python. No luck. I get the same Memory error. Tried working with the "Low_memory =True" and the same error. "File "pandas\parser.pyx", line 2049, in pandas.parser._concatenate_chunks (pandas\parser.c:27557) MemoryError" . Tried using memory_map =True. I got the exact same error. Am I doing somethign wrong?

suhas Over a year ago

Sorry but the First Solution of using Low_memory game me an error and below is the usage.df = pd.read_csv('capture2.csv', engine={'c', 'python'}, low_memory= True ) error: self._engine = klass(self.f, **self.options) UnboundLocalError: local variable 'klass' referenced before assignment

philshem Over a year ago

engine='c'. The {'c', 'python'} just shows what are the options

|

Collectives™ on Stack Overflow

Pandas.read_csv() MemoryError

2 Answers 2

Comments

8 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

8 Comments

Linked

Related