1

Basics are that I need to process 4gig text files on a per line basis.

using .readline() or for line in f is great for memory but takes ages to IO. Would like to use something like yield, but that (I think) will chop lines.

POSSIBLE ANSWER:

file.readlines([sizehint])¶
Read until EOF using readline() and return a list containing the lines

thus read. If the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. Objects implementing a file-like interface may choose to ignore sizehint if it cannot be implemented, or cannot be implemented efficiently.

Didn't realize you could do this!

3
  • are you parsing log files? Then don't. There's libraries available that will do it better. Commented Jun 29, 2011 at 10:41
  • nananananananananananananana iterators! (batman theme btw) Commented Jun 29, 2011 at 10:58
  • I'm looking through big setup files and appending lines. I'm trying to get the readlines() to work, but it's proving difficult, it doesn't appear to move onto the next chunk as requested. Commented Jun 29, 2011 at 11:16

3 Answers 3

7

You can just iterate over the file object:

with open("filename") as f:
    for line in f:
        whatever

This will do some internal buffering to improve the performance. (Note that file.readline() will perform considerably worse because it does not buffer -- that's why you can't mix iteration over a file object with file.readline().)

Sign up to request clarification or add additional context in comments.

2 Comments

That's what I meant by using .readline(), doing it like this fine on memory, but takes ages.
@jdborg: file.readline() behaves quite differently to iterating over the file. Iterating will do the buffering for you and should not impose a performance bottleneck.
1

If you want to do something on a per-line basis you can just loop over the file object:

f = open("w00t.txt")
for line in f:
    # do stuff

However, doing stuff on a per-line basis can be a actual bottleneck of performance, so perhaps you should use a better chunk size? What you can do is, for example, read 4096 bytes, find the last line ending \n, process on that part and prepend the part that is left to the next chunk.

2 Comments

That's what I meant by using .readline(), doing it like this fine on memory, but takes ages.
@jdborg: Read the second part of my answer.
0

You could always chunk the lines up? I mean why open one file and iterate all the way through when you can open the same file 6 times and iterate through. e.g.

a #is the first 1024 bytes
b #is the next 1024
#etcetc
f #is the last 1024 bytes

Each file handle running in a separate process and we start to cook on gas. Just remember to deal with line endings properly.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.