memory error reading big size csv in pandas

Question

My laptops memory is 8 gig and I was trying to read and process a big csv file, and got memory issues, I found a solution which is using chunksize to process the file chunk by chunk, but apperntly when uisng chunsize the file format vecoe textreaderfile and the code I was using to process normal csvs with it doesnt work anymore, this is the code I'm trying to use to read how many sentences inside the csv file.

wdata = pd.read_csv(fileinput, nrows=0,).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000)

data = wdata.count()
print(data)

the error I'm getting is:-

Traceback (most recent call last):
  File "table.py", line 24, in <module>
    data = wdata.count()
AttributeError: 'TextFileReader' object has no attribute 'count'

I tried another way arround aswell by running this code


TextFileReader = pd.read_csv(fileinput, chunksize=1000)  # the number of rows per chunk

dfList = []
for df in TextFileReader:
    dfList.append(df)

df = pd.concat(dfList,sort=False)
print(df)

and it gives this error


   data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 908, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 4

gbruenjes · Accepted Answer · 2020-01-20 07:04:08Z

2

You have to iterate over the chunks:

csv_length = 0    
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=10000):
    csv_length += chunk.count()
print(csv_length )

edited Jan 20, 2020 at 7:04

answered Jan 20, 2020 at 6:51

gbruenjes

4,2251 gold badge18 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

programming freak Over a year ago

this is printing 1000 , 1000 , 1000, 1000 fore more than 200 times

gbruenjes Over a year ago

@programmingfreak yes. obviously. you are reading chunks with length of 1000. and printing the length

gbruenjes Over a year ago

@programmingfreak you have to add the count of each chunk to a variable to get the full length

programming freak Over a year ago

I know I tried to append all of them to print the length of the file but it killed the process automatically for some reason, is there a way around it ?

gbruenjes Over a year ago

the other attempt doesnt really make sense. your memory is too small to process the full csv. you cant read the chunks and append them together. you have to process each chunk and clear it out of memory

|

Collectives™ on Stack Overflow

memory error reading big size csv in pandas

1 Answer 1

10 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Related