I need to be able to store a numpy array in a dict for caching purposes. Hash speed is important.
The array represents indicies, so while the actual identity of the object is not important, the value is. Mutabliity is not a concern, as I'm only interested in the current value.
What should I hash in order to store it in a dict?
My current approach is to use str(arr.data), which is faster than md5 in my testing.
I've incorporated some examples from the answers to get an idea of relative times:
In [121]: %timeit hash(str(y))
10000 loops, best of 3: 68.7 us per loop
In [122]: %timeit hash(y.tostring())
1000000 loops, best of 3: 383 ns per loop
In [123]: %timeit hash(str(y.data))
1000000 loops, best of 3: 543 ns per loop
In [124]: %timeit y.flags.writeable = False ; hash(y.data)
1000000 loops, best of 3: 1.15 us per loop
In [125]: %timeit hash((b*y).sum())
100000 loops, best of 3: 8.12 us per loop
It would appear that for this particular use case (small arrays of indicies), arr.tostring offers the best performance.
While hashing the read-only buffer is fast on its own, the overhead of setting the writeable flag actually makes it slower.
arr.tostring()does the same and is more aesthetically pleasing. If you have really big arrays you could try stringifying only a small part of the array.tostringalso appears to be orders of magnitude faster for small arrays (although 4× slower for an array of 10000 elements).stronly formats the head and tail of the array.str(arr.data)simply wrong? I used this on different arrays and got the same strings back...!?