File handling in python for large amount of data

Question

I am looking for some efficient way to read from a large file and grep a string 'abracadabra' (=s1), here s1 is always present before eof(end of file) And if s1 is present search for another string 'P QRST'

i.e. Data of File f1:

ABCD EFGH IJKL MNOP
QRST UVW XY Z
...
...
abracadabra(eof)

I am novice for python, to do this I was using following method:

f = open('f1', 'r')
for line in f:
   if 'abracadabra' in line:
      print 'Found'
      found = True
      break
else:
   print 'EOF reached, string not found'        
f.close()

f = open('f1', 'r')
for line in f:
   if 'P QRST' in line:             # 'P QRST' can be present anywhere in the file(can be either in begging or just before s1)
      #do some operation

But this approach involves huge overhead for a large file data, as it read the files twice. Please suggest some efficient approach to do this.

Is there any efficient way like a 'grep' application in Linux shell?

Why not search for both? You are already processing the file line by line, remembering if you've seen 'P QRST' while also looking for abracadabra isn't an arduous task. — Martijn Pieters
– Martijn Pieters, Commented Apr 21, 2014 at 7:50

Roland Smith · Accepted Answer · 2014-04-21 16:08:24Z

If grep already does what you want, then just use it with the subprocess module;

rv = subprocess.check_call(['grep', first, filename])
if rv is 0:
    rv = subprocess.check_call(['grep', second, filename])
    if rv is 0:
        print 'Found', second
    elif rv is 1:
        print second, 'not found'
    else:
        print 'Error looking for', second
elif rv is 1:
    print first, 'not found'
else:
    print 'Error looking for', first

Edit: (in response to comments)

Remember that the most efficient code is the one you don't have to write, since programmer time is much more expensive than computer time. Also keep in mind that grep has been well optimized. One of the big wins was not splitting the input into lines. Which your Python solution is doing twice.

At the very least you should keep the list of lines and re-use it.

But if you want to do it in Python, there are some things that you can copy from GNU grep;

Use page-aligned, page-sized blocks of memory
Avoid splitting the data into lines
Use mmap

Edit 2:

As Spacedman correctly mentions, there is no substitute for testing. So let's do that.

I ran some tests on a 40 MB test file, searching for the non-existant string abracadabra. First, using BSD grep;

> time grep abracadabra procmail.log
0.242u 0.015s 0:00.25 100.0%    58+179k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.192u 0.016s 0:00.20 100.0%    57+176k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.184u 0.023s 0:00.20 100.0%    59+183k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.199u 0.007s 0:00.20 95.0% 60+186k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.184u 0.023s 0:00.20 100.0%    59+183k 0+0io 0pf+0w
> time grep abracadabra procmail.log
0.184u 0.024s 0:00.20 100.0%    57+176k 0+0io 0pf+0w

Then with the following program:

import mmap

with open('procmail.log', 'r+b') as p:
    mm = mmap.mmap(p.fileno(), 0)
    rv = mm.find('abracadabra')
    print rv

This gave:

> time python foo.py
-1
0.139u 0.024s 0:00.16 93.7% 1701+549k 0+1io 0pf+0w
> time python foo.py
-1
0.094u 0.039s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w
> time python foo.py
-1
0.109u 0.023s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w
> time python foo.py
-1
0.117u 0.015s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w
> time python foo.py
-1
0.125u 0.007s 0:00.13 92.3% 1807+583k 0+1io 0pf+0w

So;

on my machine
searching for a simple string

using mmap in Python is slightly faster than calling BSD grep.

Keep in mind that the mmap objects (at least in python 2.7) do not support searching for regular expressions. And results may differ with file size, size of the RAM, operating system et cetera.

Yes, this can also be possible. Which approach is more efficient, in terms of memory utilization and time consumption ? As far as I know grep uses regex, so searching is much faster, but I don't have idea about the memory consumption using grep and using python file handling.
grep works line by line, so doesn't chew up memory. But then reading a file in python line by line doesn't chew up memory either. But running grep TWICE could be slower than scanning it once in Python, depends on whats the bottleneck. Do some benchmarks.
@Spacedman GNU grep does not work line-by-line. See the link after "Edit".
@RolandSmith neat trick. ignore line breaks until you get a match and then check match wrt nearest line breaks. but what it doesn't do is read the whole thing into memory and then work on it, which would be the only way that memory utilization could be an issue here.
@RolandSmith Thanks for this wonderful explanation and also for sharing your observations with an example. It's useful answer and The way you have explained explain makes the people fall in love with stackoverflow.

Spacedman · Accepted Answer · 2014-04-21 07:54:22Z

1

If you find the abracadabra, set a flag, if you find the PQRST, set a flag, if you've found both, do something.

found = False
foundPQRST = False
f = open('f1', 'r')
for line in f:
   if 'P QRST' in line:
      foundPQRST= True
      PQRSTline = line   
   if 'abracadabra' in line:
      print 'Found'
      abraline = line
      found = True
   if found and foundPQRST:
      dosomething(PQRSTline, abraline)
      break
if not found:
   print "wasn't found"
f.close()

If there are multiple PQRST-matching lines, this will call dosomething with the last one before 'abracadabra' or the first one after. Not how it saves the PQRST line and the abracadabra line when it finds matches to use it in the call. You might not need to do this, depending on what dosomething does.

Also, you might want to use a with clause with the open function.

answered Apr 21, 2014 at 7:54

Spacedman

94.7k12 gold badges147 silver badges231 bronze badges

2 Comments

ronex dicapriyo Over a year ago

This can be possible but what if I can directly check for 'abracadabra'( because if s1 is present then it is just placed before eof) # so if I can seek the file pointer to pos=(total_size-sizeof(s1)) position then after reading sizeof(s1) bytes from pos, I can move the pointer at the s tart of the file then readling line by lines and searing for 'P QRST', In C, this can be done through fseek, Is there any efficient way of doing this in python ?

Spacedman Over a year ago

I don't see the asymmetry between the two things you are looking for. You only do_something if you find both. So scan the file ONCE, looking for both, and doing your do_something AS SOON AS you find the second one of the two things. That's what my code does. Python does have "seek" and "tell" methods for files but without knowing more about your specific task you may not need them: docs.python.org/release/2.4.4/lib/bltin-file-objects.html

logc · Accepted Answer · 2014-04-21 12:19:35Z

You can use the seek method of the file handle to rewind from the end of the file, by using a 2 as its second argument. Then you can check for the abracadabra string at the end of the file, and if found, scan the whole file (from the beginning) for the P QRST string:

with open('sample.txt', 'r') as fh:
    s1 = 'abracadabra'
    fh.seek(-len(s1), 2)
    if fh.read() == s1:
        fh.seek(0)
        for line in fh:
            if 'P QRST' in line:
                print 'Bingo'

Collectives™ on Stack Overflow

File handling in python for large amount of data

3 Answers 3

5 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

Comments

Related