Saving and loading a custom dtype to/from a text file with numpy

Question

I introduce my own data type and I want to furnish it with save/load functions which would operate on text files, but I fail to provide a proper fmt string to numpy.savetxt(). The problem arises due to the fact that one of the fields of my dtype is a tuple (two floats in the naive example below), which I think effectively results in an attempt of saving a 3D object with savetxt().

It can be made work only when saving a number of floats as "%s" (but then I can not loadtxt() them, variant 1 in the code) or when introducing an inefficient my_repr() function (variant 2) below.

I can not believe that numpy does not provide an efficient formatter/save/load api to custom types. Anyone with an idea of solving it nicely?

import numpy as np


def main():
    my_type = np.dtype([('single_int', np.int), 
                        ('two_floats', np.float64, (2,))])
    my_var = np.array(  [(1, (2., 3.)), 
                         (4, (5., 6.))
                        ],  
                        dtype=my_type)
    # Verification
    print(my_var)
    print(my_var['two_floats'])
    # Let's try to save and load it in three variants
    variant = 2
    if variant == 0:
        # the line below would not work: "ValueError: fmt has wrong number of % formats:  %d %f %f"
        np.savetxt('f.txt', my_var, fmt='%d %f %f')
        # so I don't even try to load
    elif variant == 1:
        # The line below does work, but saves floats between '[]' which makes them not loadable later
        np.savetxt('f.txt', my_var, fmt='%d %s')
        # lines such as "1 [2. 3.]" won't load, the line below raises an Exception
        my_var_loaded = np.loadtxt('f.txt', dtype=my_type)
    elif variant == 2:
        # An ugly workaround:
        def my_repr(o):
            return [(elem['single_int'], *elem['two_floats']) for elem in o]
        # and then the rest works fine:
        np.savetxt('f.txt', my_repr(my_var), fmt='%d %f %f')
        my_var_loaded = np.loadtxt('f.txt', dtype=my_type)
        print('my_var_loaded')
        print(my_var_loaded)

if __name__ == '__main__':
    main()

savetxt just does fmt%tuple(row). In other words this simple python % style formatting. — hpaulj
– hpaulj, Commented Apr 9, 2020 at 5:20
What do you want the csv to look like? It is fundamentally a flat 2d format. genfromtxt can transform that into a complex structured dtype. — hpaulj
– hpaulj, Commented Apr 9, 2020 at 5:21
@hpaulj I want to be able to achieve what is shown in the working variant 2 above, but without using the inefficient my_repr(). If I save as in variant 1, I can then load the data with a specific loader, but I lose the formatting functionality (such as precision, etc) at the stage of saving, because I save a bunch of floats as one string. — Maciek
– Maciek, Commented Apr 9, 2020 at 5:28
One more comment: in fact it is not precise to say that np.savetxt() just does fmt%tuple(row), because before that it does asarray(). If I pass a list then all column types are changed to one and the same type what invalidates the fmt string I was trying to use. — Maciek
– Maciek, Commented Apr 9, 2020 at 15:41

hpaulj · Accepted Answer · 2020-04-09 16:19:47Z

In [115]: my_type = np.dtype([('single_int', np.int),  
     ...:                         ('two_floats', np.float64, (2,))])                                   
In [116]: my_var = np.array(  [(1, (2., 3.)),  
     ...:                          (4, (5., 6.)) 
     ...:                         ],   
     ...:                         dtype=my_type)                                                       
In [117]: my_var                                                                                       
Out[117]: 
array([(1, [2., 3.]), (4, [5., 6.])],
      dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])

Jumping straight to the loading step:

In [118]: txt = """1 2. 3. 
     ...: 4 5. 6."""                                                                                   
In [119]: np.genfromtxt(txt.splitlines(), dtype=my_type)                                               
Out[119]: 
array([(1, [2., 3.]), (4, [5., 6.])],
      dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])

As I commented savetxt is simply doing:

for row in my_var:
    f.write(fmt % tuple(row))

So we have to, in one way or other, work around or with the basic Python % formatting. Either that, or write our own text file. There's nothing magical about savetxt. It's plain python.

===

Recent numpy versions include a function to 'flatten' a structured array:

In [120]: import numpy.lib.recfunctions as rf                                                          
In [121]: arr = rf.structured_to_unstructured(my_var)                                                  
In [122]: arr                                                                                          
Out[122]: 
array([[1., 2., 3.],
       [4., 5., 6.]])
In [123]: np.savetxt('test.csv', arr, fmt='%d %f %f')                                                  
In [124]: cat test.csv                                                                                 
1 2.000000 3.000000
4 5.000000 6.000000
In [125]: np.genfromtxt('test.csv', dtype=my_type)                                                     
Out[125]: 
array([(1, [2., 3.]), (4, [5., 6.])],
      dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])

edit

Saving an object dtype array gets around a lot of the formatting issues:

In [182]: my_var                                                                                       
Out[182]: 
array([(1, [2., 3.]), (4, [5., 6.])],
      dtype=[('single_int', '<i8'), ('two_floats', '<f8', (2,))])
In [183]: def my_repr(o): 
     ...:             return [(elem['single_int'], *elem['two_floats']) for elem in o] 
     ...:                                                                                              
In [184]: my_repr(my_var)                                                                              
Out[184]: [(1, 2.0, 3.0), (4, 5.0, 6.0)]
In [185]: np.array(_,object)                                                                           
Out[185]: 
array([[1, 2.0, 3.0],
       [4, 5.0, 6.0]], dtype=object)
In [186]: np.savetxt('f.txt', _, fmt='%d %f %f')                                                       
In [187]: cat f.txt                                                                                    
1 2.000000 3.000000
4 5.000000 6.000000

Thank you. I knew most of what you wrote, but not rf.structured_to_unstructured(). It is a replacement of the my_repr() in variant 2 of the question. I +1 your answer instead of accepting; I timed both functions and the custom made list apprehension my_repr() is an order of magnitude faster than the numpy (for the data type from my example), which I don't understand (I was expecting the opposite), but I see no point in using it, thus my_repr() is so far the best solution, although it probably consumes more memory. Perhaps numpy does many internal checks in structured_to_unstructured()
Great, I didn't know it would fix the problem, that was the last element of the puzzle ;-)

Collectives™ on Stack Overflow

Saving and loading a custom dtype to/from a text file with numpy

1 Answer 1

edit

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

edit

3 Comments

Related