MemoryError in Python

Question

I have text file, its size is 300 MB. I want to read it and then print 50 most frequently used words. When i run the program it gives me MemoryError. My code is as under:-

import sys, string 
import codecs 
import re
from collections import Counter
import collections
import itertools
import csv
import re
import unicodedata


words_1800 = []

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
   for line in File_1800:
       sepFile_1800 = line.lower()
       words_1800.extend(re.findall('\w+', sepFile_1800))


for wrd_1800 in [words_1800]:
       long_1800=[w for w in words_1800 if len(w)>3]
       common_words_1800 = dict(Counter(long_1800).most_common(50))

print(common_words_1800)

It give me the following error:-

Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 17, in <module>
words_1800.extend(re.findall('\w+', sepFile_1800))
MemoryError

What is the for wrd_1800 in [words_1800] supposed to do, exactly? — 301_Moved_Permanently
– 301_Moved_Permanently, Commented Sep 17, 2015 at 9:04
What's your file contents look like? can you add a sample data to your question? — Kasravnd
– Kasravnd, Commented Sep 17, 2015 at 9:10
Its a for loop which can print those words which length are more than 3. I also try to remove this, but when i run it, it stuck in a loop. — Alam
– Alam, Commented Sep 17, 2015 at 9:11
@Kasramvd yes. This file contains some books which are published in 18 century. It look likes this. "EVERY MAN IN HIS HUMOUR By Ben Jonson INTRODUCTION THE greatest of English dramatists except Shakespeare, the first literary dictator and poet-laureate, a writer of verse, prose, satire, and criticism who most potently of all the men of his time affected the subsequent course of English letters: such was Ben Jonson, and as such his strong personality assumes an interest to us almost unparalleled, at least in his age. " — Alam
– Alam, Commented Sep 17, 2015 at 9:14
I think that words_1800.extend(re.findall('\w+', sepFile_1800)) is giving an endless loop. — Remi Guan
– Remi Guan, Commented Sep 17, 2015 at 9:14

Kasravnd · Accepted Answer · 2015-09-17 09:26:14Z

4

You can use a generator container instead of a list to store the result of re.findall which is much optimized in terms of memory use, you can also use re.finditer instead of findall which returns an iterator.

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
       words_1800=(re.findall('\w+', line.lower()) for line in File_1800)

Then the words_1800 will be an iterator contain lists of founded words or use

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
       words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)

to get an iterator contains iterators.

edited Sep 17, 2015 at 9:26

answered Sep 17, 2015 at 9:15

Kasravnd

108k19 gold badges166 silver badges194 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Alam Over a year ago

Traceback (most recent call last): File "C:\Python34\CommonWords.py", line 21, in <module> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 21, in <listcomp> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 17, in <genexpr> words_1800=(re.finditer('\w+', line.lower()) for line in File_1800) ValueError: I/O operation on closed file.

301_Moved_Permanently Over a year ago

@Alam you should put the last line inside the with.

Alam Over a year ago

Still the error "ValueError: I/O operation on closed file."

301_Moved_Permanently Over a year ago

You must also consume the words_1800 iterator within the with. I.e. before the file gets closed.

Alam Over a year ago

@MathiasEttinger can you please explain with the help of code? I didnt pick your suggestion :)

|

301_Moved_Permanently · Accepted Answer · 2015-09-17 09:28:19Z

3

You can use the Counter upfront saving you memory from using intermediate lists (especially words_1800 which is as big as the file you’re reading):

common_words_1800 = Counter()

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
    for line in File_1800:
        for match in re.finditer(r'\w+', line.lower()):
            word = match.group()
            if len(word) > 3:
                common_words_1800[word] += 1

print(common_words_1800.most_common(50))

edited Sep 17, 2015 at 9:28

answered Sep 17, 2015 at 9:22

301_Moved_Permanently

4,2161 gold badge17 silver badges30 bronze badges

Comments

Padraic Cunningham · Accepted Answer · 2015-09-17 10:36:57Z

If your file contains ascii you don't need a regex, you can split the words and rstrip the punctuation creating your Counter with a generator expression:

from string import punctuation
from collections import Counter

with open('E:\\Book\\1800.txt') as f:
   cn = Counter(wrd for line in f for wrd in (w.rstrip(punctuation)
            for w in line.lower().split()) if len(wrd) > 3)
   print(cn.most_common(50))

If you were using a regex you should compile it first and you can use it with a generator:

from collections import Counter
import re
with open('E:\\Book\\1800.txt') as f:
    r = re.compile("\w+")
    cn = Counter(wrd for line in f  
                 for wrd in r.findall(line) if len(wrd) > 3)
    print(cn.most_common(50))

Mateusz Kleinert · Accepted Answer · 2015-09-17 09:24:22Z

0

Your code is working good, however it looks a little bit memory inefficient. If your file has 300 MB then there can be a lot of words to process. Try to use suggestions given by @Kasramvd. It seems to be a good idea to use iterators instead of full lists.

In addition, here is a fine blog post about checking memory usage and profiling applications in python - Python - memory usage.

answered Sep 17, 2015 at 9:24

Mateusz Kleinert

1,39611 silver badges20 bronze badges

Collectives™ on Stack Overflow

MemoryError in Python

4 Answers 4

6 Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Linked

Related