How to work with large files in python?

Question

I'm curiuos how to work with large files in python?

For example I have dataset on hard drive ~20Gb (just array of numbers) and I want to sort this array to get k min values. So dataset can't be load into memory(RAM).

I think algorithm should be: load dataset by n chunks, find k min in chunk, store k min in memory and process every chunk, so we get k*n values and then sort them to get k min values.

But the question is how to store dataset(what format?), what is the fastest method to load it from disk(what size of chunk I must choose for particular hardware?)Maybe it can be done by using several threads?

What kinds of limits are there on these numbers? Is there some upper/lower bound? This will affect the sorts of approaches you might use. — Donagh Hatton
– Donagh Hatton, Commented Apr 9, 2014 at 13:46
Have your read neopythonic.blogspot.de/2008/10/… ? Can you provide more details about your specific task, dataset etc.? — dorvak
– dorvak, Commented Apr 9, 2014 at 13:48
You should use heapq.nsmallest() on a lazy iterator over your data. No need to get the k smallest values for each chunk -- you can get your result on the fly in a single pass. — Sven Marnach
– Sven Marnach, Commented Apr 9, 2014 at 13:52

AechoLiu · Accepted Answer · 2014-04-10 06:08:46Z

1

You need external sort instead. If you load everything into memory and sort them, it is named internal sort. In database, it uses external sort to do sorting task.

Maybe the following resources would help you.

edited Apr 10, 2014 at 6:08

answered Apr 10, 2014 at 6:03

AechoLiu

18.6k11 gold badges108 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to work with large files in python?

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related