For loop speed with Numpy

Question

I am trying to get this code running fast in python however I am having trouble getting it to run anywhere near the speed it runs in MATLAB. The problem seems to be this for loop which takes about 2 second to run when the number "SRpixels" is approximately equal to 25000.

I cant seem to find any way to trim this down any further, and I am looking for suggestions.

The datatypes for the numpy arrays below are float32 for all except the **_Location[] which are uint32.

for j in range (0,SRpixels):
    #Skip data if outside valid range
    if (abs(SR_pointCloud[j,0]) > SR_xMax or SR_pointCloud[j,2] > SR_zMax or SR_pointCloud[j,2] < 0):
        pass
    else:           
        RIGrid1_Location[j,0] = np.floor(((SR_pointCloud[j,0] + xPosition + 5) - xGrid1Center) / gridSize)
        RIGrid1_Location[j,1] = np.floor(((SR_pointCloud[j,2] + yPosition) - yGrid1LowerBound) / gridSize)

        RIGrid1_Count[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += 1
        RIGrid1_Sum[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += SR_pointCloud[j,1]
        RIGrid1_SumofSquares[RIGrid1_Location[j,0],RIGrid1_Location[j,1]] += SR_pointCloud[j,1] * SR_pointCloud[j,1]

        RIGrid2_Location[j,0] = np.floor(((SR_pointCloud[j,0] + xPosition + 5) - xGrid2Center) / gridSize)
        RIGrid2_Location[j,1] = np.floor(((SR_pointCloud[j,2] + yPosition) - yGrid2LowerBound) / gridSize)

        RIGrid2_Count[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += 1 
        RIGrid2_Sum[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += SR_pointCloud[j,1]
        RIGrid2_SumofSquares[RIGrid2_Location[j,0],RIGrid2_Location[j,1]] += SR_pointCloud[j,1] * SR_pointCloud[j,1]

I did attempt to use Cython, where I replaced j with a cdef int j and compiled. There was no noticeable performance gain. Anyone have suggestions?

If you are going to use cython, you will need to type more than just the loop index to reap the full benefit. However it looks like you could vectorize many of the operations over the entire array without needing a for loop at all. — JoshAdel
– JoshAdel, Commented Jun 7, 2013 at 20:40
python array access is not very fast. you could use a variable for SR_pointCloud[j,2] for example, as you call that several times. — njzk2
– njzk2, Commented Jun 7, 2013 at 20:42

cge · Accepted Answer · 2013-06-07 21:22:25Z

Vectorization is almost always the best way to speed up numpy code, and much of this seems vectorizable. To start, for example, the location arrays seem quite simple to do:

# these are all of your j values
inds = np.arange(0,SRpixels)

# these are the j values you don't want to skip
sel = np.invert((abs(SR_pointCloud[inds,0]) > SR_xMax) | (SR_pointCloud[inds,2] > SR_zMax) | (SR_pointCloud[inds,2] < 0))

RIGrid1_Location[sel,0] = np.floor(((SR_pointCloud[sel,0] + xPosition + 5) - xGrid1Center) / gridSize)
RIGrid1_Location[sel,1] = np.floor(((SR_pointCloud[sel,2] + yPosition) - yGrid1LowerBound) / gridSize)
RIGrid2_Location[sel,0] = np.floor(((SR_pointCloud[sel,0] + xPosition + 5) - xGrid2Center) / gridSize)
RIGrid2_Location[sel,1] = np.floor(((SR_pointCloud[sel,2] + yPosition) - yGrid2LowerBound) / gridSize)

This has no python loop.

The rest are trickier and will depend upon what you are doing, but should also be vectorizable if you think about them in this way.

If you really have something that can't be vectorized and must be done with a loop—I've only had this happen a few times—I'd suggest Weave over Cython. It's harder to use, but should give speeds comparable to C.

HYRY · Accepted Answer · 2013-06-07 23:08:32Z

Try vectorization the calculation first, if you must do calculation element by element, here is some speedup hint:

Calculation with NumPy scalar is much slower than builtin scalars. array[i, j] will get a numpy scalar, and array.item(i,j) will return a builtin scalar.
functions in math module is faster than numpy when do scalar calculation.

Here is an example:

import numpy as np
import math
a = np.array([[1.1, 2.2, 3.3],[4.4, 5.5, 6.6]])
%timeit np.floor(a[0,0]*2)
%timeit math.floor(a[0,0]*2)
%timeit np.floor(a.item(0,0)*2)
%timeit math.floor(a.item(0,0)*2)

output:

100000 loops, best of 3: 10.2 µs per loop
100000 loops, best of 3: 3.49 µs per loop
100000 loops, best of 3: 6.49 µs per loop
1000000 loops, best of 3: 851 ns per loop

So change np.floor to math.floor, change SR_pointCloud[j,0] to SR_pointCloud.item(j,0) will speedup the loop alot.

Collectives™ on Stack Overflow

For loop speed with Numpy

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related