I'm curiuos how to work with large files in python?
For example I have dataset on hard drive ~20Gb (just array of numbers) and I want to sort this array to get k min values. So dataset can't be load into memory(RAM).
I think algorithm should be: load dataset by n chunks, find k min in chunk, store k min in memory and process every chunk, so we get k*n values and then sort them to get k min values.
But the question is how to store dataset(what format?), what is the fastest method to load it from disk(what size of chunk I must choose for particular hardware?)Maybe it can be done by using several threads?
heapq.nsmallest()on a lazy iterator over your data. No need to get the k smallest values for each chunk -- you can get your result on the fly in a single pass.