Is there a smart way to vectorize a nested for-loop where the inner index is limited by the outer index?

Question

Is there a smart way to vectorize a nested for loop of inner products, where the inner index is lower bound by the outer index?

Here's a simple example. Say that arr1 and arr2 are numpy arrays each containing N vectors of length 3, i.e. they are of shape (N,3). N is usually in the range between 500 and 2000. I want to compute all possible combinations of inner products, but I also know that the mathematical problem is designed in such a way that the inner product of the i-th vector in arr1 and j-th vector in arr2 is equal to the inner product of the j-th vector in arr1 and i-th vector in arr2.

In an explicit nested for-loop this would look something like this.

N = 2000
# For simplicity, I choose arr1 and arr2 to be the same here, but this need not always be the case.
arr1 = np.arange(N*3).reshape((N,3))
arr2 = arr1.copy()

inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):  
    vec1 = arr1[idx1]
    for idx2 in range(idx1, arr2.shape[0]):
        vec2 = arr2[idx2]
        inner_products[idx1,idx2] = np.dot(vec1,vec2)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2

Is there a smart way to vectorize this in a single command, without carrying out twice as many calculations as needed?

So far, I vectorized the inner loop with

inner_products = np.zeros((arr1.shape[0],arr2.shape[0]))
for idx1 in range(arr1.shape[0]):  
    vec1 = arr1[idx1]
    inner_products[idx1,idx1:] = np.dot(vec1.reshape(1,3), arr2[idx1:].T)
inner_products = inner_products + inner_products.T
inner_products[np.diag_indices(arr1.shape[0])] /= 2

But this variant is still slower than computing all N*N inner products with a single np.dot command despite computing twice as many inner products.

inner_products = np.dot(arr1, arr2.T)

Numba compilation also does not change the outcome; the single np.dot command remains the fastest.

Is there a smart way to vectorize these sort of problems where the amount of computations is kept at a minimum?

Jérôme Richard · Accepted Answer · 2025-02-16 03:39:51Z

np.dot is clearly not optimal here because the input arrays are integer-typed ones and also very thin. This is a pathological case. Besides, note that Numpy is also not optimized for arrays having very few items on the contiguous axis (a (3,N) shape should be preferred). Fortunately, we can write a faster Numba code for this specific use-case:

import numba as nb

# Same as np.dot(arr1,arr2.T) for arrays with a shape (N,3)
@nb.njit('(int64[:,::1], int64[:,::1])', parallel=True)
def inner_product(arr1, arr2):
    n = arr1.shape[0]
    assert arr1.shape[1] == 3 and arr2.shape == (n, 3)
    res = np.empty((n, n), dtype=arr1.dtype)
    for i in nb.prange(n):
        for j in range(n):
            s = 0
            if j < i: # Transposed case
                for k in range(3):
                    s += arr1[j, k] * arr2[i, k]
            else:
                for k in range(3):
                    s += arr1[i, k] * arr2[j, k]
            res[i, j] = s
    return res

Benchmark

Here are results on my i5-9600KF CPU (6 cores) on the input array provided in the question (so with N = 2000):

%timeit -n 10 first_op_code(arr1, arr2)
# 3 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 10 loop each)

%timeit -n 10 second_op_code(arr1, arr2)
# 54.5 ms ± 291 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 inner_product(arr1, arr2)
# 5.54 ms ± 362 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 np.dot(arr1, arr2.T)
# 26.1 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

We can see that the Numba code is about 10 times faster than second_op_code (and also faster than np.dot which produces different results in some cases). I suspect the function to be rather memory-bound (or even bound by page-faults on Windows) since it does not scale well (the code is also faster in sequential). It makes sense since there is not a lot of computation to do compared to the memory accesses. Pre-allocating the output array might help (especially on Windows).

Note that enforcing the matrix symmetry by performing res[j, i] = res[i, j] is simple but not efficient (due to the non-contiguous access pattern). This is actually about twice slower. Transposing the matrix is also rather slow for the same reason (though it can be optimized). Thus, I just recomputed the bottom-left part by transposing the input indices. This is significantly faster than the alternative solution. Re-computing values is cheap here because the whole operation is actually rather memory-bound.

Note that using smaller data types can also help to make the operation faster. You should be sure that there are no possible overflow though.

How large are your test arrays? Also, OP didn't say that their arrays are only integers.
I used the array in the provided example (with N = 2000) and I assumed the example is representative. If it is not, then this is a bad example. That being said, please note the code can be adapted by just replacing the 3 int64 array dtypes by other native types (e.g. float64) so it support other type. I modified it so there is just the function signature to adapt. This code is still 2 times faster with np.float64 (it takes about the same time and np.dot is faster since BLAS functions can be used but are inefficient because of the reason mentioned in the answer).
OP said that arr1 and arr2 aren't necessarily the same. Try rng = np.random.default_rng(1); arr1 = rng.integers(-20, 20, (N,3)); arr2 = rng.integers(-20, 20, (N,3)). The results are then different.
If you change for j in range(n): to for j in range(i, n): and res[i, j] = s to res[i, j] = res[j, i] = s then your code is fine.
@jared Indeed. I choose another strategy than just using res[i, j] = res[j, i] = s since it was significantly slower (due to the inefficient memory access pattern). I explained that in the modified answer. The new code is just a bit slower than the previous one now. Thank you for the clarification.

Collectives™ on Stack Overflow

Is there a smart way to vectorize a nested for-loop where the inner index is limited by the outer index?

1 Answer 1

Benchmark

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Benchmark

7 Comments

Related