Why is reading large files in python so slow?

Question

I am trying to read a csv file I created before in python using

with open(csvname, 'w') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',')
    csvwriter.writerows(data)

Data ist a random matrix containing about 30k * 30k entries, np.float32 format. About 10 GB file size in total.

Once I read in the file using this function (since I know the size of my matrix already and np.genfromtxt is increadibly slow and would need about 100 GB RAM at this point)

def read_large_txt(path, delimiter=',', dtype=np.float32, nrows = 0):
    t1 = time.time()
    with open(path, 'r') as f:
        out = np.empty((nrows, nrows), dtype=dtype)
        for (ii, line) in enumerate(f):
            if ii%2 == 0:
                out[int(ii/2)] = line.split(delimiter)
    print('Reading %s took %.3f s' %(path, time.time() - t1))
return out

it takes me about 10 minutes to read that file. The hard drive I am using should be able to read about 100 MB/s which would decrease the reading time to about 1-2 minutes.

Any ideas what I may be doing wrong?

Related: why numpy narray read from file consumes so much memory? That's where the function read_large_txt is from.

Maybe I should add that I am using if ii%2 == 0: because otherwise I'd try to pass empty lines to the output matrix — Forrest Thumb
– Forrest Thumb, Commented Apr 17, 2018 at 9:31
Extract out initialization from the reading time to be sure it is related with the file size. If nrows is big it may use swap — Benjamin
– Benjamin, Commented Apr 17, 2018 at 9:34
Yep I got 120 GB RAM. It does read the entire file, I was just wondering If there is a way to do that faster. — Forrest Thumb
– Forrest Thumb, Commented Apr 17, 2018 at 9:35
Maybe split has a bad implementation like in Java (causing a lot of string allocation), try with csv reader — Benjamin
– Benjamin, Commented Apr 17, 2018 at 9:36

Forrest Thumb · Accepted Answer · 2018-04-17 12:17:27Z

1

I found a quite simple solution. Since I am creating the files as well, I don't need to save them as a .csv-file. It is way (!) faster to load them as .npy files:

Loading (incl. splitting each line by ',') a 30k * 30k matrix stored as .csv takes about 10 minutes. Doing the same with a matrix stored as .npy takes about 10 seconds!

That's why I have to change the code I wrote above to:

np.save(npyname, data)

and in the other script to

out = np.load(npyname + '.npy')

Another advantage of this method is: (in my case) the .npy files only have about 40% the size of the .csv files. :)

edited Apr 17, 2018 at 12:17

answered Apr 17, 2018 at 12:05

Forrest Thumb

4317 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jpp Over a year ago

I'd advise you also look up HDF5 if you have larger data. This has native compression and, on top, is accessible across languages.

Forrest Thumb Over a year ago

@jpp Does HDF5 increase the loading speed further?

jpp Over a year ago

For large data (say 1GB+), if you choose the correct options, in my experience yes.

Forrest Thumb Over a year ago

Might be worth a try then. My files have 4GB+ with npy and 10 GB+ with csv.

Forrest Thumb Over a year ago

@jpp I checked it out. To be honest, I was not able to increase the loading speed. Probably I am already at the machine limit (4 GB in 4-6 s is already about 1 GB/s). Anyway, using the gzip compression (compression_opts 9) I could decrease the file size a little bit (3.7 GB instead of 4 GB), but my data are quite random, so that might work out better with other data structures. But, using that compression, writing to the hard drive takes much longer. For now I will probably keep using .npy files.

|

Collectives™ on Stack Overflow

Why is reading large files in python so slow?

1 Answer 1

7 Comments

Linked

Hot Network Questions