How to optimize math operations on matrix in python

Question

I am trying to reduce the time of a function that performs a serie of calculations with two matrix. Searching for this, I've heard of numpy, but I really do not know how apply it to my problem. Also, I Think one of the things is making my function slow is having many dots operators (I heard of that in this this page ).

The math correspond with a factorization for the Quadratic assignment problem:

QAP Factorization

My code is:

    delta = 0
    for k in xrange(self._tam):
        if k != r and k != s:
            delta +=
                self._data.stream_matrix[r][k] \
                * (self._data.distance_matrix[sol[s]][sol[k]] - self._data.distance_matrix[sol[r]][sol[k]]) + \
                self._data.stream_matrix[s][k] \
                * (self._data.distance_matrix[sol[r]][sol[k]] - self._data.distance_matrix[sol[s]][sol[k]]) + \
                self._data.stream_matrix[k][r] \
                * (self._data.distance_matrix[sol[k]][sol[s]] - self._data.distance_matrix[sol[k]][sol[r]]) + \
                self._data.stream_matrix[k][s] \
                * (self._data.distance_matrix[sol[k]][sol[r]] - self._data.distance_matrix[sol[k]][sol[s]])
    return delta

Running this on a problem of size 20 (Matrix of 20x20) take about 20 segs, the bottleneck is in this function

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
303878   15.712    0.000   15.712    0.000 Heuristic.py:66(deltaC)

I tried to apply map to the for loop, but because the loop body isn't a function call, it is not possible.

How could I reduce the time?

EDIT1

To answer eickenberg comment:

sol is a permutation, for example [1,2,3,4]. the function is called when I am generating neighbor solutions, so, a neighbor of [1,2,3,4] is [2,1,3,4]. I am changing only two positions in the original permutation and then call deltaC, which calculates a factorization of the solution with positions r,s swaped (In the example above r,s = 0,1). This permutation is made to avoid calculate the entire cost of the neighbor solution. I suppose I can store the values of sol[k,r,s] in a local variable to avoid looking up its value in each iteration. I do not know if this is what you was asking in your comment.

EDIT2

A minimal working example:

import random


distance_matrix = [[0, 12, 6, 4], [12, 0, 6, 8], [6, 6, 0, 7], [4, 8, 7, 0]]
stream_matrix = [[0, 3, 8, 3], [3, 0, 2, 4], [8, 2, 0, 5], [3, 4, 5, 0]]

def deltaC(r, s, S=None):
    '''
    Difference between C with values i and j swapped
    '''

    S = [0,1,2,3]

    if S is not None:
        sol = S
    else:
        sol = S

    delta = 0

    sol_r, sol_s = sol[r], sol[s]

    for k in xrange(4):
        if k != r and k != s:
            delta += (stream_matrix[r][k] \
                * (distance_matrix[sol_s][sol[k]] - distance_matrix[sol_r][sol[k]]) + \
                stream_matrix[s][k] \
                * (distance_matrix[sol_r][sol[k]] - distance_matrix[sol_s][sol[k]]) + \
                stream_matrix[k][r] \
                * (distance_matrix[sol[k]][sol_s] - distance_matrix[sol[k]][sol_r]) + \
                stream_matrix[k][s] \
                * (distance_matrix[sol[k]][sol_r] - distance_matrix[sol[k]][sol_s]))
    return delta


for _ in xrange(303878):
    d = deltaC(random.randint(0,3), random.randint(0,3))
print d

Now I think the better option is use NumPy. I tried with Matrix(), but did not improve the performance.

Best solution found

Well, Finally I was able to reduce the time a bit more combining @TooTone's solution and storing the indexes in a set to avoid the if. The time has dropped from about 18 seconds to 8 seconds. Here is the code:

def deltaC(self, r, s, sol=None):
    delta = 0
    sol = self.S if sol is None else self.S
    sol_r, sol_s = sol[r], sol[s]

    stream_matrix = self._data.stream_matrix
    distance_matrix = self._data.distance_matrix

    indexes = set(xrange(self._tam)) - set([r, s])

    for k in indexes:
        sol_k = sol[k]
        delta += \
            (stream_matrix[r][k] - stream_matrix[s][k]) \
            * (distance_matrix[sol_s][sol_k] - distance_matrix[sol_r][sol_k]) \
            + \
            (stream_matrix[k][r] - stream_matrix[k][s]) \
            * (distance_matrix[sol_k][sol_s] - distance_matrix[sol_k][sol_r])
    return delta

In order to reduce the time even more, I think the best way would be write a module.

You should take a look at numpy for optimized numerical computations. It is a mature library that has been designed exactly for this reason. And almost always their code will be more optimized than one you write yourself. — Can Ibanoglu
– Can Ibanoglu, Commented Apr 12, 2014 at 17:19
A first attempt should always be to try to vectorize your numpy operations. As of now, your code is as suboptimal as it can conceivably be for numpy: using a for loop and looking up e.g. sol[s] every single iteration although it stays the same. Before trying to present a solution, it would be great if you could tell us whether this operation has to be done for all r, s and whether sol is a fixed permutation of indices or not. If vectorisation doesn't work (it should, though), then you can look into compiling numerical expressions, using e.g. numexpr, but I'd keep that for much later — eickenberg
– eickenberg, Commented Apr 12, 2014 at 17:36
Also, could you tell us the dimensions r, s and k on which you would like to use this? — eickenberg
– eickenberg, Commented Apr 12, 2014 at 17:38
If self._data.stream_matrix and self._data.distance_matrix were a numpy matrix class it would be more efficent? — Alejandro Alcalde
– Alejandro Alcalde, Commented Apr 12, 2014 at 17:59

TooTone · Accepted Answer · 2014-04-13 00:53:19Z

In the simple example you've given, with for k in xrange(4): the loop body only executes twice (if r==s), or three times (if r!=s) and an initial numpy implementation, below, is slower by a large factor. Numpy is optimized for performing calculations over long vectors and if the vectors are short the overheads can outweigh the benefits. (And note in this formula, the matrices are being sliced in different dimensions, and indexed non-contiguously, which can only make things more complicated for a vectorizing implementation).

import numpy as np

distance_matrix_np = np.array(distance_matrix)
stream_matrix_np = np.array(stream_matrix)
n = 4

def deltaC_np(r, s, sol):
    delta = 0
    sol_r, sol_s = sol[r], sol[s]

    K = np.array([i for i in xrange(n) if i!=r and i!=s])

    return np.sum(
        (stream_matrix_np[r,K] - stream_matrix_np[s,K]) \
        *  (distance_matrix_np[sol_s,sol[K]] - distance_matrix_np[sol_r,sol[K]]) + \
        (stream_matrix_np[K,r] - stream_matrix_np[K,s]) \
        * (distance_matrix_np[sol[K],sol_s] - distance_matrix_np[sol[K],sol_r]))

In this numpy implementation, rather than a for loop over the elements in K, the operations are applied across all the elements in K within numpy. Also, note that your mathematical expression can be simplified. Each term in brackets on the left is the negative of the term in brackets on the right. enter image description here

This applies to your original code too. For example, (self._data.distance_matrix[sol[s]][sol[k]] - self._data.distance_matrix[sol[r]][sol[k]]) is equal to -1 times (self._data.distance_matrix[sol[r]][sol[k]] - self._data.distance_matrix[sol[s]][sol[k]]), so you were doing unnecessary computation, and your original code can be optimized without using numpy.

It turns out that the bottleneck in the numpy function is the innocent-looking list comprehension

K = np.array([i for i in xrange(n) if i!=r and i!=s])

Once this is replaced with vectorizing code

if r==s:
    K=np.arange(n-1)
    K[r:] += 1
else:
    K=np.arange(n-2)
    if r<s:
        K[r:] += 1
        K[s-1:] += 1
    else:
        K[s:] += 1
        K[r-1:] += 1

the numpy function is much faster.

A graph of run times is shown immediately below (right at the bottom of this answer is the original graph before optimizing the numpy function). You can see that it either makes sense to use your optimized original code or the numpy code, depending on how large the matrix is.

enter image description here

The full code is below for reference, partly in case someone else can take it further. (The function deltaC2 is your original code optimized to take account of the way the mathematical expression can be simplified.)

def deltaC(r, s, sol):
    delta = 0
    sol_r, sol_s = sol[r], sol[s]
    for k in xrange(n):
        if k != r and k != s:
            delta += \
                stream_matrix[r][k] \
                * (distance_matrix[sol_s][sol[k]] - distance_matrix[sol_r][sol[k]]) + \
                stream_matrix[s][k] \
                * (distance_matrix[sol_r][sol[k]] - distance_matrix[sol_s][sol[k]]) + \
                stream_matrix[k][r] \
                * (distance_matrix[sol[k]][sol_s] - distance_matrix[sol[k]][sol_r]) + \
                stream_matrix[k][s] \
                * (distance_matrix[sol[k]][sol_r] - distance_matrix[sol[k]][sol_s])
    return delta

import numpy as np

def deltaC_np(r, s, sol):
    delta = 0
    sol_r, sol_s = sol[r], sol[s]

    if r==s:
        K=np.arange(n-1)
        K[r:] += 1
    else:
        K=np.arange(n-2)
        if r<s:
            K[r:] += 1
            K[s-1:] += 1
        else:
            K[s:] += 1
            K[r-1:] += 1
    #K = np.array([i for i in xrange(n) if i!=r and i!=s]) #TOO SLOW

    return np.sum(
        (stream_matrix_np[r,K] - stream_matrix_np[s,K]) \
        *  (distance_matrix_np[sol_s,sol[K]] - distance_matrix_np[sol_r,sol[K]]) + \
        (stream_matrix_np[K,r] - stream_matrix_np[K,s]) \
        * (distance_matrix_np[sol[K],sol_s] - distance_matrix_np[sol[K],sol_r]))

def deltaC2(r, s, sol):
    delta = 0
    sol_r, sol_s = sol[r], sol[s]
    for k in xrange(n):
        if k != r and k != s:
            sol_k = sol[k]
            delta += \
                (stream_matrix[r][k] - stream_matrix[s][k]) \
                * (distance_matrix[sol_s][sol_k] - distance_matrix[sol_r][sol_k]) \
                + \
                (stream_matrix[k][r] - stream_matrix[k][s]) \
                * (distance_matrix[sol_k][sol_s] - distance_matrix[sol_k][sol_r])
    return delta


import time

N=200

elapsed1s = []
elapsed2s = []
elapsed3s = []
ns = range(10,410,10)
for n in ns:
    distance_matrix_np=np.random.uniform(0,n**2,size=(n,n))
    stream_matrix_np=np.random.uniform(0,n**2,size=(n,n))
    distance_matrix=distance_matrix_np.tolist()
    stream_matrix=stream_matrix_np.tolist()
    sol  = range(n-1,-1,-1)
    sol_np  = np.array(range(n-1,-1,-1))

    Is = np.random.randint(0,n-1,4)
    Js = np.random.randint(0,n-1,4)

    total1 = 0
    start = time.clock()
    for reps in xrange(N):
        for i in Is:
            for j in Js:
                total1 += deltaC(i,j, sol)
    elapsed1 = (time.clock() - start)
    start = time.clock()

    total2 = 0
    start = time.clock()
    for reps in xrange(N):
        for i in Is:
            for j in Js:
                total2 += deltaC_np(i,j, sol_np)
    elapsed2 = (time.clock() - start)

    total3 = 0
    start = time.clock()
    for reps in xrange(N):
        for i in Is:
            for j in Js:
                total3 += deltaC2(i,j, sol_np)
    elapsed3 = (time.clock() - start)

    print n, elapsed1, elapsed2, elapsed3, total1, total2, total3
    elapsed1s.append(elapsed1)
    elapsed2s.append(elapsed2)
    elapsed3s.append(elapsed3)

    #Check errors of one method against another
    #err = 0
    #for i in range(min(n,50)):
    #    for j in range(min(n,50)):
    #        err += np.abs(deltaC(i,j,sol)-deltaC_np(i,j,sol_np))
    #print err
import matplotlib.pyplot as plt

plt.plot(ns, elapsed1s, label='Original',lw=2)
plt.plot(ns, elapsed3s, label='Optimized',lw=2)
plt.plot(ns, elapsed2s, label='numpy',lw=2)
plt.legend(loc='upper left', prop={'size':16})
plt.xlabel('matrix size')
plt.ylabel('time')
plt.show()

And here is the original graph before optimizing out the list comprehension in deltaC_np

enter image description here

Thank you very much for your great work. I am trying to implement delta_np, but I get only integer arrays with one element can be converted to an index in sol[k]. With my minimal example, after calling delta_np, here is a picture while debugging : postimg.org/image/p276kbcxf
Maybe I have confused you, sol must be a permutation of the solution, if the problem is of size 4, sol must be a list of len 4.
@algui91 check that sol is a numpy array. I got a ton of errors like that when I was developing this (get my full program at the bottom of the answer working first if you are having serious problems). Try to use numpy arrays throughout if you can -- witness the speedup when I removed the list comprehension.
@algui91 I may have confused myself by insisting on a minimal example! The code I've given should work for arbitrary matrix sizes. (I found for very large matrices however that I started to get overflow messages with integer elements so I switched to double precision.) Out of interest what's a typical matrix size?

Collectives™ on Stack Overflow

How to optimize math operations on matrix in python

EDIT1

EDIT2

Best solution found

1 Answer 1

9 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

EDIT1

EDIT2

Best solution found

1 Answer 1

9 Comments

Linked

Related