I am investigating caching behavior in different languages. I create two matrices in python using lists (yes, I know it's a linked list but bear with me here). I then multiply these matrices together in three ways:
def baseline_matrix_multiply(a, b, n):
'''
baseline multiply
'''
c = zero_matrix(n)
for i in range(n):
for j in range(n):
for k in range(n):
c[i][j] += a[i][k] * b[k][j]
return c
def baseline_matrix_multiply_flipjk(a, b, n):
'''
same as baseline but switch j and k loops
'''
c = zero_matrix(n)
for i in range(n):
for k in range(n):
for j in range(n):
c[i][j] += a[i][k] * b[k][j]
return c
def fast_matrix_multiply_blocking(a, b, n):
'''
use blocking
'''
c = zero_matrix(n)
block = 25;
en = int(block * n/block)
for kk in range(0, en, block):
for jj in range(0, en, block):
for i in range(n):
for j in range(jj, jj + block):
sum = c[i][j]
for k in range(kk, kk + block):
sum += a[i][k] * b[k][j]
c[i][j] = sum
return c
My timings are as follows:
Baseline:
3.440004294627216
Flip j and k:
3.4685347505603144
100.83% of baseline
Blocking:
2.729924394035205
79.36% of baseline
Some things to note:
I am familiar with CPU caching behavior. To see my experiment in C see here though I havent gotten any reviews for it.
I've done this in Javascript and C# and the flip-j-k function provides significant performance gains using arrays (JS run on chrome browser)
Python implementation is Python 3.5 by way of Anaconda
Please dont tell me about numpy. My experiment is not about absolute performance but rather understanding caching behavior.
Question: Anyone know what is going on here? Why does flip-j-k not provide speedup? Is it because it's a linked list? But then why does blocking provide a non-marginal improvement in performance?