2

I have text file, its size is 300 MB. I want to read it and then print 50 most frequently used words. When i run the program it gives me MemoryError. My code is as under:-

import sys, string 
import codecs 
import re
from collections import Counter
import collections
import itertools
import csv
import re
import unicodedata


words_1800 = []

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
   for line in File_1800:
       sepFile_1800 = line.lower()
       words_1800.extend(re.findall('\w+', sepFile_1800))


for wrd_1800 in [words_1800]:
       long_1800=[w for w in words_1800 if len(w)>3]
       common_words_1800 = dict(Counter(long_1800).most_common(50))

print(common_words_1800)

It give me the following error:-

Traceback (most recent call last):
File "C:\Python34\CommonWords.py", line 17, in <module>
words_1800.extend(re.findall('\w+', sepFile_1800))
MemoryError
9
  • 3
    What is the for wrd_1800 in [words_1800] supposed to do, exactly? Commented Sep 17, 2015 at 9:04
  • What's your file contents look like? can you add a sample data to your question? Commented Sep 17, 2015 at 9:10
  • Its a for loop which can print those words which length are more than 3. I also try to remove this, but when i run it, it stuck in a loop. Commented Sep 17, 2015 at 9:11
  • @Kasramvd yes. This file contains some books which are published in 18 century. It look likes this. "EVERY MAN IN HIS HUMOUR By Ben Jonson INTRODUCTION THE greatest of English dramatists except Shakespeare, the first literary dictator and poet-laureate, a writer of verse, prose, satire, and criticism who most potently of all the men of his time affected the subsequent course of English letters: such was Ben Jonson, and as such his strong personality assumes an interest to us almost unparalleled, at least in his age. " Commented Sep 17, 2015 at 9:14
  • I think that words_1800.extend(re.findall('\w+', sepFile_1800)) is giving an endless loop. Commented Sep 17, 2015 at 9:14

4 Answers 4

4

You can use a generator container instead of a list to store the result of re.findall which is much optimized in terms of memory use, you can also use re.finditer instead of findall which returns an iterator.

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
       words_1800=(re.findall('\w+', line.lower()) for line in File_1800)

Then the words_1800 will be an iterator contain lists of founded words or use

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
       words_1800=(re.finditer('\w+', line.lower()) for line in File_1800)

to get an iterator contains iterators.

Sign up to request clarification or add additional context in comments.

6 Comments

Traceback (most recent call last): File "C:\Python34\CommonWords.py", line 21, in <module> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 21, in <listcomp> long_1800=[w for w in wrd_1800 if len(w)>3] File "C:\Python34\CommonWords.py", line 17, in <genexpr> words_1800=(re.finditer('\w+', line.lower()) for line in File_1800) ValueError: I/O operation on closed file.
@Alam you should put the last line inside the with.
Still the error "ValueError: I/O operation on closed file."
You must also consume the words_1800 iterator within the with. I.e. before the file gets closed.
@MathiasEttinger can you please explain with the help of code? I didnt pick your suggestion :)
|
3

You can use the Counter upfront saving you memory from using intermediate lists (especially words_1800 which is as big as the file you’re reading):

common_words_1800 = Counter()

with open('E:\\Book\\1800.txt', "r", encoding='ISO-8859-1') as File_1800:
    for line in File_1800:
        for match in re.finditer(r'\w+', line.lower()):
            word = match.group()
            if len(word) > 3:
                common_words_1800[word] += 1

print(common_words_1800.most_common(50))

Comments

1

If your file contains ascii you don't need a regex, you can split the words and rstrip the punctuation creating your Counter with a generator expression:

from string import punctuation
from collections import Counter

with open('E:\\Book\\1800.txt') as f:
   cn = Counter(wrd for line in f for wrd in (w.rstrip(punctuation)
            for w in line.lower().split()) if len(wrd) > 3)
   print(cn.most_common(50))

If you were using a regex you should compile it first and you can use it with a generator:

from collections import Counter
import re
with open('E:\\Book\\1800.txt') as f:
    r = re.compile("\w+")
    cn = Counter(wrd for line in f  
                 for wrd in r.findall(line) if len(wrd) > 3)
    print(cn.most_common(50))

Comments

0

Your code is working good, however it looks a little bit memory inefficient. If your file has 300 MB then there can be a lot of words to process. Try to use suggestions given by @Kasramvd. It seems to be a good idea to use iterators instead of full lists.

In addition, here is a fine blog post about checking memory usage and profiling applications in python - Python - memory usage.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.