Python Matrix Multiplication and Caching

Question

I am investigating caching behavior in different languages. I create two matrices in python using lists (yes, I know it's a linked list but bear with me here). I then multiply these matrices together in three ways:

def baseline_matrix_multiply(a, b, n):
    '''
    baseline multiply
    '''
    c = zero_matrix(n)
    for i in range(n):
        for j in range(n):
            for k in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


def baseline_matrix_multiply_flipjk(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i in range(n):
        for k in range(n):
            for j in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


def fast_matrix_multiply_blocking(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk in range(0, en, block):
        for jj in range(0, en, block):
            for i in range(n):
                for j in range(jj, jj + block):
                    sum = c[i][j]
                    for k in range(kk, kk + block):
                        sum += a[i][k] * b[k][j]
                    c[i][j] = sum
    return c

My timings are as follows:

Baseline:
3.440004294627216

Flip j and k:
3.4685347505603144
100.83% of baseline

Blocking:
2.729924394035205
79.36% of baseline

Some things to note:

I am familiar with CPU caching behavior. To see my experiment in C see here though I havent gotten any reviews for it.
I've done this in Javascript and C# and the flip-j-k function provides significant performance gains using arrays (JS run on chrome browser)
Python implementation is Python 3.5 by way of Anaconda
Please dont tell me about numpy. My experiment is not about absolute performance but rather understanding caching behavior.

Question: Anyone know what is going on here? Why does flip-j-k not provide speedup? Is it because it's a linked list? But then why does blocking provide a non-marginal improvement in performance?

"yes, I know it's a linked list but bear with me here" - it's actually not. It's implemented with a dynamically-resized array, like a Java ArrayList or C++ vector. — user2357112
– user2357112, Commented Oct 14, 2016 at 19:35
I'm wondering if the benefit of caching isn't dwarfed by the indexed access & boundary checking. In the last case, you sort of "cache" the destination value, reducing by 30% the indexed accesses. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Oct 14, 2016 at 19:39
@user2357112 I stand corrected. They are dynamically resizing arrays but really, they end up as an array of pointers to the actual elements right? The elements themselves aren't stored contiguously in memory? That's what I understand from docs.python.org/3/faq/design.html#how-are-lists-implemented — fpes
– fpes, Commented Oct 17, 2016 at 14:57

Jean-François Fabre · Accepted Answer · 2016-10-14 20:00:45Z

The latter version (block multiplication) is only faster by 30% because you save 30% of index accessing by using a local variable in the inner loop!

(and FYI this is not C++: list type is like C++ vector otherwise people would waste their time trying to access elements by index. So this is the fastest standard random access container available in python)

I just made some test program based on your code to prove my point and what I was suspecting (I had to complete your code, sorry for the big block but at least it is minimal complete & verifiable for all).

def zero_matrix(sz):
    return [[0]*sz for i in range(sz)]

def baseline_matrix_multiply(a, b, n):
    '''
    baseline multiply
    '''
    c = zero_matrix(n)
    for i in range(n):
        for j in range(n):
            for k in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c


def baseline_matrix_multiply_flipjk(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i in range(n):
        for k in range(n):
            for j in range(n):
                c[i][j] += a[i][k] * b[k][j]
    return c

def baseline_matrix_multiply_flipjk_faster(a, b, n):
    '''
    same as baseline but switch j and k loops
    '''
    c = zero_matrix(n)
    for i in range(n):
        ci = c[i]
        for k in range(n):
            bk = b[k]
            aik = a[i][k]
            for j in range(n):
                ci[j] += aik * bk[j]
    return c

def fast_matrix_multiply_blocking(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk in range(0, en, block):
        for jj in range(0, en, block):
            for i in range(n):
                for j in range(jj, jj + block):
                    sum = c[i][j]
                    for k in range(kk, kk + block):
                        sum += a[i][k] * b[k][j]
                    c[i][j] = sum
    return c

def fast_matrix_multiply_blocking_faster(a, b, n):
    '''
    use blocking
    '''
    c = zero_matrix(n)

    block = 25;
    en = int(block * n/block)

    for kk in range(0, en, block):
        for jj in range(0, en, block):
            for i in range(n):
                ai = a[i]
                ci = c[i]
                for j in range(jj, jj + block):
                    s = ci[j]
                    for k in range(kk, kk + block):
                        s += ai[k] * b[k][j]
                    ci[j] = s
    return c

def init_ab(sz):
    return [list(range(sz)) for i in range(sz)],[[3]*sz for i in range(sz)]

sz=200


import time

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply(a,b,sz)
print("baseline_matrix_multiply: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply_flipjk(a,b,sz)
print("baseline_matrix_multiply_flipjk: "+str(time.time()-start_time))

a,b = init_ab(sz)


start_time=time.time()
fast_matrix_multiply_blocking(a,b,sz)
print("fast_matrix_multiply_blocking: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
baseline_matrix_multiply_flipjk_faster(a,b,sz)
print("**baseline_matrix_multiply_flipjk_faster**: "+str(time.time()-start_time))

a,b = init_ab(sz)

start_time=time.time()
fast_matrix_multiply_blocking_faster(a,b,sz)
print("**fast_matrix_multiply_blocking_faster**: "+str(time.time()-start_time))

results on my PC (last results surrounded with stars are my versions):

baseline_matrix_multiply: 2.578160047531128
baseline_matrix_multiply_flipjk: 2.5315518379211426
fast_matrix_multiply_blocking: 1.9359750747680664
**baseline_matrix_multiply_flipjk_faster**: 1.4532990455627441
**fast_matrix_multiply_blocking_faster**: 1.7031919956207275

As you can see, my version of your baseline_matrix_multiply_flipjk (the fourth one) is faster than even the block multiply, meaning that index checking & accessing dwarves the cache effect that you can experience in compiled languages & using direct pointers like C or C++.

I just stored the values that were not changing in the inner loop (the one to optimize most) to avoid index access.

Note that I tried to apply the same recipe to the block multiply and I gained some time compared to your version, but is still beaten by the flipjk_faster version because of the unability to avoid index access.

Maybe compiling the code using Cython and drop the checks would get the result you want. But pre-computing indexes never hurts.

upvote. yeah, I'm fairly sure this is what the OP was after actually :/

Frogboxe · Accepted Answer · 2016-10-14 19:46:43Z

Python tends to not cache the results of its functions. It requires explicit notice of when you want build a cache for a function. You can do this using the lrc_cache decorator.

The following is code I threw together the other day and I've just added some readability. If there's something wrong, comment and I'll sort it out:

from functools import lru_cache

from random import randint as rand
from time import clock as clk

recur = 0

#@lru_cache(maxsize=4, typed=False)
def Func(m, n):
    global recur
    recur += 1
    if m == 0:
        return n + 1
    elif n == 0:
        return Func(m - 1, 1)
    else:
        return Func(m - 1, Func(m, n - 1))


n = []
m = 0

ITER = 50
F1 = 3
F2 = 4


staTime = clk()
for x in range (ITER):
    n.append(Func(F1, F2))

m += clk()-staTime

print("Uncached calls of Func:: "+str(recur//ITER))
print("Average time per solving of Ackerman's function:: "+str(m/ITER))
print("End result:: "+str(n[0]))

BTW: "#" indicates a comment, in case you're unfamiliar with Python.

So try running this and try AGAIN without the "#" on the line #@lru_cache(maxsize=4, typed=False)

Additionally, maxsize is the maximum size of the cache (works best in powers of two according to the docs) and typed just makes the cache add a new cached condition whenever a different type of the same value is passed as an argument.

Finally, the "Func" function is known as Ackermann's function. A stupidly deep amount of recursion goes on so be aware of stackoverflows (lol) if you should change the max recursion depth.

I'm pretty sure the questioner is thinking of the CPU's cache, not any sort of function return value caching optimization.
Ah ok, i see. I don't know about any CPU caching so nevermind.

Collectives™ on Stack Overflow

Python Matrix Multiplication and Caching

2 Answers 2

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Linked

Related