1

I'm curiuos how to work with large files in python?

For example I have dataset on hard drive ~20Gb (just array of numbers) and I want to sort this array to get k min values. So dataset can't be load into memory(RAM).

I think algorithm should be: load dataset by n chunks, find k min in chunk, store k min in memory and process every chunk, so we get k*n values and then sort them to get k min values.

But the question is how to store dataset(what format?), what is the fastest method to load it from disk(what size of chunk I must choose for particular hardware?)Maybe it can be done by using several threads?

5
  • What kinds of limits are there on these numbers? Is there some upper/lower bound? This will affect the sorts of approaches you might use. Commented Apr 9, 2014 at 13:46
  • 1
    Have your read neopythonic.blogspot.de/2008/10/… ? Can you provide more details about your specific task, dataset etc.? Commented Apr 9, 2014 at 13:48
  • 6
    You should use heapq.nsmallest() on a lazy iterator over your data. No need to get the k smallest values for each chunk -- you can get your result on the fly in a single pass. Commented Apr 9, 2014 at 13:52
  • For the more general questions, use PyTables. Commented Apr 9, 2014 at 14:13
  • @DonaghHatton I use int32 numbers. Commented Apr 10, 2014 at 5:50

1 Answer 1

1

You need external sort instead. If you load everything into memory and sort them, it is named internal sort. In database, it uses external sort to do sorting task.

Maybe the following resources would help you.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.