Handling big text files in Python

Question

Basics are that I need to process 4gig text files on a per line basis.

using .readline() or for line in f is great for memory but takes ages to IO. Would like to use something like yield, but that (I think) will chop lines.

POSSIBLE ANSWER:

file.readlines([sizehint])¶

Read until EOF using readline() and return a list containing the lines
thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.

Didn't realize you could do this!

are you parsing log files? Then don't. There's libraries available that will do it better. — Dhaivat Pandya
– Dhaivat Pandya, Commented Jun 29, 2011 at 10:41
I'm looking through big setup files and appending lines. I'm trying to get the readlines() to work, but it's proving difficult, it doesn't appear to move onto the next chunk as requested. — joedborg
– joedborg, Commented Jun 29, 2011 at 11:16

Sven Marnach · Accepted Answer · 2011-06-29 11:19:23Z

7

You can just iterate over the file object:

with open("filename") as f:
    for line in f:
        whatever

This will do some internal buffering to improve the performance. (Note that file.readline() will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline().)

edited Jun 29, 2011 at 11:19

answered Jun 29, 2011 at 10:39

Sven Marnach

607k123 gold badges966 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

joedborg Over a year ago

That's what I meant by using .readline(), doing it like this fine on memory, but takes ages.

Sven Marnach Over a year ago

@jdborg: file.readline() behaves quite differently to iterating over the file. Iterating will do the buffering for you and should not impose a performance bottleneck.

orlp · Accepted Answer · 2011-06-29 10:49:04Z

1

If you want to do something on a per-line basis you can just loop over the file object:

f = open("w00t.txt")
for line in f:
    # do stuff

However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size? What you can do is, for example, read 4096 bytes, find the last line ending \n, process on that part and prepend the part that is left to the next chunk.

edited Jun 29, 2011 at 10:49

answered Jun 29, 2011 at 10:41

orlp

119k39 gold badges226 silver badges324 bronze badges

2 Comments

joedborg Over a year ago

That's what I meant by using .readline(), doing it like this fine on memory, but takes ages.

orlp Over a year ago

@jdborg: Read the second part of my answer.

Jakob Bowyer · Accepted Answer · 2011-06-29 11:01:15Z

0

You could always chunk the lines up? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through. e.g.

a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes

Each file handle running in a separate process and we start to cook on gas. Just remember to deal with line endings properly.

answered Jun 29, 2011 at 11:01

Jakob Bowyer

34.8k8 gold badges80 silver badges91 bronze badges

Collectives™ on Stack Overflow

Handling big text files in Python

3 Answers 3

2 Comments

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Related