Speeding up loops over a Numpy array

Question

In my code I have for loop that indexes over a multidimensional numpy array and does some operation using the sub-array that is obtained at each iteration. It looks like this

for sub in Arr:
  #do stuff using sub

Now the stuff that is done using sub is fully vectorized, so it should be efficient. On the other hand this loop iterates about ~10^5 times and is the bottleneck. Do you think I will get an improvement by offloading this part to C. I am somewhat reluctant to do so because the do stuff using sub uses broadcasting, slicing, smart-indexing tricks that would be tedious to write in plain C. I would also welcome thoughts and suggestions about how to deal with broadcasting, slicing, smart-indexing when offloading computation to C.

I am not sure, but it seems to me that you are using an optimized processes for what it was intended for. I don't see a reason not to use it... — soandos
– soandos, Commented May 12, 2011 at 3:27
@soandos Could you elaborate. I did not quite understand you, especially what the 'it' in your comment refers to. — san
– san, Commented May 12, 2011 at 3:29
@san: for * in Arr was designed to be very good at going though the array. It would seem then, that one should use it. I do not know how much of it difference it could make though, honestly, as 10^5 times is not all that much (on a standard 1.66 ghz computer, one can go through well over 330 million interations a second through a loop) and so a better question would be: how can you speed up your "do stuff" — soandos
– soandos, Commented May 12, 2011 at 3:36
Have you played with Cython? It won't fix your algorithm, but it will strip out a lot of the python overhead. — matt
– matt, Commented May 12, 2011 at 4:53
@san - maybe you even don't have to iterate over the array. If you show us more details from your do stuff, we can find a better solution without a loop. — eumiro
– eumiro, Commented May 12, 2011 at 6:07

JoshAdel · Accepted Answer · 2011-05-12 12:35:00Z

If you can't 'vectorize' the entire operation and looping is indeed the bottleneck, then I highly recommend using Cython. I've been dabbling around with it recently and it is straightforward to work with and has a decent interface with numpy. For something like a langevin integrator I saw a 115x speedup over a decent implementation in numpy. See the documentation here:

http://docs.cython.org/src/tutorial/numpy.html

and I also recommend looking at the following paper

You may see satisfactory speedups by just typing the input array and the loop counter, but if you want to leverage the full potential of cython, then you are going to have to hardcode the equivalent broadcasting.

srean · Accepted Answer · 2011-05-18 21:59:44Z

San you can take a look at scipy.weave. You can use scipy.weave.blitz to transparently translate your expression into C++ code and run it. It will handle slicing automatically and get rid of temporaries, but you claim that the body of your for loop does not create temporaries so your milage may vary.

However if you want to replace your entire for loop with something more efficient then you could make use of scipy.inline. The drawback is that you have to write C++ code. This should not be too hard because you can use Blitz++ syntax which is very close to numpy array expressions. Slicing is directly supported, broadcasting however is not.

There are two work arounds:

is to use the numpy-C api and use multi-dimensional iterators. They transparently handle broadcasting. However you are invoking the Numpy runtime so there might be some overhead. The other option, and possibly the simpler option is to use the usual matrix notation for broadcasting. Broadcast operations can be written as outer-products with vector of all ones. The good thing is that Blitz++ will not actually create this temporary broadcasted arrays in memory, it will figure out how to wrap it into an equivalent loop.
For the second option take a look at http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC88 for index place holders. As long as your matrix has less than 11 dimensions you are fine. This link shows how they can be used to form outer products http://www.oonumerics.org/blitz/docs/blitz_3.html#SEC99 (search for outer products to go to the relevant part of the document).

riza · Accepted Answer · 2011-05-17 09:27:59Z

1

Besides using Cython, you can write the bottle-neck part(s) in Fortran. Then use f2py to compile it to Python .pyd file.

answered May 17, 2011 at 9:27

riza

17.3k7 gold badges31 silver badges29 bronze badges

Collectives™ on Stack Overflow

Speeding up loops over a Numpy array

3 Answers 3

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Linked

Related