Efficient way of writing numpy arrays to file in Python

Question

The data that I process is ~ 6 Million and it takes a lot of time to write to a file. How do I improve it ?

The following are the two approaches that I tried:

import numpy as np
import time
test_data = np.random.rand(6000000,12)
T1 = time.time()
np.savetxt('test',test_data, fmt='%.4f', delimiter=' ' )
T2 = time.time() 
print "Time:",T2-T1,"Sec"
file3=open('test2','w')
for i in range(6000000):
    for j in range(12):
        file3.write('%6.4f\t' % (test_data[i][j]))
    file3.write('\n')
T3 = time.time() 
print "Time:",T3-T2,"Sec"

Time: 56.6293179989 Sec

Time: 115.468323946 Sec

I am dealing with atleast 100 files like this and the total time is a lot, please help. Also, I am not writing in .npy or compressed format as I need to read them in matlab and do further processing.

You really shouldn't be using time for benchmarking, and a single-run test for something that's mostly disk I/O is especially bad, but since you've already done it, why don't you tell us what the results were? — abarnert
– abarnert, Commented Jul 18, 2018 at 0:01
Anyway, sometimes saving in compressed format (e.g., with savez_compressed) is faster, sometimes it's slower. (Basically, there's a lot more CPU work, but significantly less I/O, so it depends on how fast your drive is and how fast and how loaded-down your CPU is.) So, you should test that as well. — abarnert
– abarnert, Commented Jul 18, 2018 at 0:02
Disk I/O is slow. The biggest performance gain you can get is from upgrading hardware, especially from an HDD to an SSD. Otherwise, software optimization will only give relatively smaller gains. — Code-Apprentice
– Code-Apprentice, Commented Jul 18, 2018 at 0:14
For MATLAB use you could try scipy/io/savemat, which handles the traditional MATLAB .mat format. MATLAB can also handle hdf5 file format, though its save/load format is somewhat involved, and won't be easy to replicate with the Python h5py tool. — hpaulj
– hpaulj, Commented Jul 18, 2018 at 0:47

abarnert · Accepted Answer · 2018-07-18 00:36:51Z

save will almost always be hugely faster than savetxt. It just dumps the raw bytes, without having to format them as text. It also writes smaller files, which means less I/O. And you'll get equal benefits at load time: less I/O, and no text parsing.

Everything else below is basically a variant on top of the benefits of save. And if you look at the times at the end, all of them are within an order of magnitude of each other, but all around two orders of magnitude faster than savetxt. So, you may just be happy with the 200:1 speedup and not care about trying to tweak things any farther. But, if you do need to optimize further, read on.

savez_compressed saves the array with DEFLATE compression. This means you waste a bunch of CPU, but save some I/O. If it's a slow disk that's slowing you down, that's a win. Note that with smallish arrays, the constant overhead will probably hurt more than the compression speedup will help, and if you have a random array there's little to no compression possible.

savez_compressed is also a multi-array save. That may seem unnecessary here, but if you chunk a huge array into, say, 20 smaller ones, this can sometimes go significantly faster. (Even though I'm not sure why.) The cost is that if you just load ip the .npz and stack the arrays back together, you don't get contiguous storage, so if that matters, you have to write more complicated code.

Notice that my test below uses a random array, so the compression is just wasted overhead. But testing against zeros or arange would be just as misleading in the opposite direction, so… this is something to test on your real data.

Also, I'm on a computer with a pretty fast SSD, so the tradeoff between CPU and I/O may not be as imbalanced as on whatever machine you're running on.

numpy.memmap, or an array allocated into a stdlib mmap.mmap, is backed to disk with a write-through cache. This shouldn't reduce the total I/O time, but it means that the I/O doesn't happen all at once at the end, but is instead spread around throughout your computation—which often means it can happen in parallel with your heavy CPU work. So, instead of spending 50 minutes calculating and then 10 minutes saving, you spend 55 minutes calculating-and-saving.

This one is hard to test in any sensible way with a program that isn't actually doing any computation, so I didn't bother.

pickle or one of its alternatives like dill or cloudpickle. There's really no good reason a pickle should be faster than a raw array dump, but occasionally it seems to be.

For a simple contiguous array like the one in my tests, the pickle is just a small wrapper around the exact same bytes as the binary dump, so it's just pure overhead.

For comparison, here's how I'm testing each one:

In [70]: test_data = np.random.rand(1000000,12)
In [71]: %timeit np.savetxt('testfile', test_data)
9.95 s ± 222 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [72]: os.stat('testfile').st_size
Out[74]: 300000000

Notice the use of %timeit there. If you're not using IPython, use the timeit module in the stdlib to do the same thing a little verbosely. Testing with time has all kinds of problems (as described in the timeit docs, but the biggest is that you're only doing a single rep. And for I/O-based benchmarks, that's especially bad.

Here's the results for each—but, given the caveats above, you should really only consider the first two meaningful.

savetxt: 9.95s, 300MB
save: 45.8 ms, 96MB
savez_compressed: 360ms, 90MB
pickle: 287ms, 96MB

Brad Solomon · Accepted Answer · 2018-07-18 00:25:22Z

Have you considered h5py?

Here's a cursory single-run time comparison:

>>> import numpy as np
>>> import time
>>> import h5py
>>> test_data = np.random.rand(6000000,12)
>>> file = h5py.File('arrays.h5', 'w')

>>> %time file.create_dataset('test_data', data=test_data, dtype=data.dtype)
CPU times: user 1.28 ms, sys: 224 ms, total: 225 ms
Wall time: 280 ms
<HDF5 dataset "test_data": shape (6000000, 12), type "<f8">

>>> %time np.savetxt('test',test_data, fmt='%.4f', delimiter=' ' )
CPU times: user 24.4 s, sys: 617 ms, total: 25 s
Wall time: 26.3 s

>>> file.close()

I use a commercial software called Abaqus within which I run my python scripts to get data and there is no h5py module in abaqus :(

Junhee Shin · Accepted Answer · 2018-07-18 00:12:44Z

How about using pickle? I found that it is more fast.

import numpy as np
import time
import pickle
test_data = np.random.rand(1000000,12)

T1 = time.time()
np.savetxt('testfile',test_data, fmt='%.4f', delimiter=' ' )
T2 = time.time()
print ("Time:",T2-T1,"Sec")

file3=open('testfile','w')
for i in range(test_data.shape[0]):
    for j in range(test_data.shape[1]):
        file3.write('%6.4f\t' % (test_data[i][j]))
    file3.write('\n')
file3.close()
T3 = time.time()
print ("Time:",T3-T2,"Sec")

file3 = open('testfile','wb')
pickle.dump(test_data, file3)
file3.close()
T4 = time.time()
print ("Time:",T4-T3,"Sec")

# load data
file4 = open('testfile', 'rb')
obj = pickle.load(file4)
file4.close()
print(obj)

the output is

Time: 9.1367928981781 Sec
Time: 16.366491079330444 Sec
Time: 0.41736602783203125 Sec

Collectives™ on Stack Overflow

Efficient way of writing numpy arrays to file in Python

3 Answers 3

Comments

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Related