improving speed of this numpy-based diffraction calculator

Question

I'm going to simulate diffraction patterns of a normal incident gaussian profile beam from a 2D array of point scatterers with a distribution in heights.

The 2D array of scatterer positions X, Y and Z each have size N x N and these are summed-over in each call to E_ab(a, b, positions, w_beam). This is done M x M times to build up the diffraction pattern.

If I estimate ten floating point operations per scatter site per pixel and a nanosecond per flop (what my laptop does for small numpy arrays) I'd expect the time to be 10 M^2 N^2 1E-09 seconds. For small N this runs a factor of 50 or 100 slower than that, and for large N (larger than say 2000) it slows down even further. I am guessing these have something to do with the paging of the large arrays in memory.

What can I do to increase the speed for large N?

note: Right now the height variation Z is random, in the future I plan to include an additional systematic height variation term as well, so even though purely gaussian variation might have an analytical solution, I need to do this numerically.

Since I'm distributing Z height randomly here, plots will look a little different each time. My output (run on a laptop) is as follows, and I can not even begin to understand why it takes longer (~ 16 seconds) when w_beam is small than when it is large (~6 seconds). (When I run this on an older Python 2 that came with IDLE all four are between 10 and 10.5 seconds)

My estimator 10 M^2 N^2 1E-09 suggests 0.25 seconds, these are roughly 50 times slower so there may be room for substantial improvement.

1 16.460583925247192
2 14.861294031143188
4 8.405776023864746
8 6.4988932609558105

Total time: about 46 seconds on 2012 MacBook and recent Anaconda Python 3 installation.

Python script:

import numpy as np
import matplotlib.pyplot as plt
import time

def E_ab(a, b, positions, w_beam, k0):
    X, Y, Z = positions
    Rsq = X**2 + Y**2
    phases = k0 * (a*X + b*Y + (1 + np.sqrt(1 - a**2 - b**2))*Z)
    E = np.exp(-Rsq/w_beam**2)  * np.exp(-j*phases)
    return E.sum() / w_beam**2 # rough normalization

twopi, j = 2*np.pi, np.complex(0, 1)

wavelength = 0.08
k0  = twopi/wavelength

z_noise = 0.05 * wavelength

N, M = 100, 50
x = np.arange(-N, N+1)
X, Y = np.meshgrid(x, x)
Z = z_noise * np.random.normal(size=X.shape) # use random Z noise for now
positions = [X, Y, Z]

A = np.linspace(0, 0.2, M)

answers = []
for w_beam in (1, 2, 4, 8):
    E = []
    tstart = time.time()
    for i, a in enumerate(A):
        EE = []
        for b in A:
            e = E_ab(a, b, positions, w_beam, k0)
            EE.append(e)
        E.append(EE)
    print(w_beam, time.time() - tstart)
    answers.append(np.array(E))

if True:
    plt.figure()
    for i, E in enumerate(answers):
        plt.subplot(2, 2, i+1)
        plt.imshow(np.log10(np.abs(E)), vmin=0.0001)
        plt.colorbar()
    plt.show()

$plot of a diffraction pattern$

Is k0 supposed to be a global? It is used in E_ab but not passes as an argument. — Gaslight Deceive Subvert
– Gaslight Deceive Subvert, Commented Jan 7, 2020 at 4:44
@BjörnLindqvist either way is okay with me but your way would be better so I've made k0 an argument. — uhoh
– uhoh, Commented Jan 7, 2020 at 7:31
It probably takes longer for small w_beam, because the exponent in np.exp(-Rsq/w_beam**2) is larger and takes longer to compute. — Graipher
– Graipher, Commented Jan 7, 2020 at 9:23
@Graipher ah that makes sense! I was thinking "but they're all small numbers, almost zero" but forgot to think about how they got that way. Interestingly my Python 2 uses numpy version 1.13.0 and all four times are between 10 and 11 seconds (reproducibly) while my Python 3 Anaconda installation with numpy version 1.17.3 shows the times from 16 to 6 seconds as shown. I'll be surprised if np.exp() has changed, doesn't it call some transcendental functions built into my laptop's CPU? — uhoh
– uhoh, Commented Jan 7, 2020 at 16:39

Gaslight Deceive Subvert · Accepted Answer · 2020-01-07 13:19:29Z

I managed to make your script a bit faster. See the comments in the source code:

import numpy as np
import matplotlib.pyplot as plt
import time

# Makes it so numpy always outputs the same random numbers which is
# useful during development. Comment out the line in production.
np.random.seed(1234)

def main():
    twopi, img = 2*np.pi, np.complex(0, 1)

    wavelength = 0.08
    k0  = twopi/wavelength

    z_noise = 0.05 * wavelength

    N, M = 100, 50
    x = np.arange(-N, N+1)

    # Use random Z noise for now
    Z = z_noise * np.random.normal(size = (x.shape[0], x.shape[0]))

    A = np.linspace(0, 0.2, M)

    tstart = time.time()
    answers = []

    # X and Y with shapes (x, 1) and (1, x) contains the same data.
    X = x[np.newaxis,:]
    Y = x[:,np.newaxis]

    # Compute squared Euclidean distances with shape (x, x).
    Rsq = X**2 + Y**2

    # Compute other distances with shape (M, M).
    a = A[:,np.newaxis]
    b = A[np.newaxis,:]
    SQ = 1 + np.sqrt(1 - a**2 - b**2)

    # Add noise thing (M, M, x, x).
    SQ_Z = SQ[:,:,np.newaxis,np.newaxis] * Z

    # Gets shape (M, 1, x) and (M, x, 1).
    A_X = A[:,np.newaxis,np.newaxis] * X
    A_Y = A[:,np.newaxis,np.newaxis] * Y

    # Calculates A*X + A*Y with shape (M, M, x, x).
    A_X_Y = A_X[:,np.newaxis] + A_Y[np.newaxis,:]

    # e^{-j*phases} with shape (M, M, x, x).
    A_stuff = np.exp(-img * k0 * (A_X_Y + SQ_Z))
    for w_beam in (1, 2, 4, 8):
        coeff = np.exp(-Rsq/w_beam**2)
        # Sums over the last two dimensions, shape becomes (M, M).
        E = np.sum(A_stuff * coeff, axis = (2, 3)) / w_beam**2
        answers.append(E)

    print(f'Time: {time.time() - tstart:.3}')
    # Print checksum. Useful for debugging.
    print(sum(np.sum(E) for e in answers))
    if True:
        plt.figure()
        for i, E in enumerate(answers):
            plt.subplot(2, 2, i+1)
            plt.imshow(np.log10(np.abs(E)), vmin=0.0001)
            plt.colorbar()
        plt.show()

main()

I have wrapped all your code in a main function. It is good practice to do so, even for small scripts, but it makes it easy to see if there's any dependencies on global variables.

The script runs about four times faster. Most of the speedup comes from replacing your explicit loop with vector functions. So instead of

for i, a in enumerate(A):
    EE = []
    for b in A:
        e = E_ab(a, b, positions, w_beam, k0)
        EE.append(e)
    E.append(EE)

those lines now read

coeff = np.exp(-Rsq/w_beam**2)
E = np.sum(A_stuff * coeff, axis = (2, 3)) / w_beam**2
answers.append(E)

To get the data in a format that vector functions can handle I've added a lot of broadcasting. It can be used to compute, for example, the cartesian product of arbitrary numpy arrays.

I think someone who understands the math (which I don't) could optimize your script even more. Some of the calculations seem redundant to me, but I have no idea how to factor them out.

I did a quick test and I'm wondering what time you got when you say "a bit faster" because this is far slower on my laptop (2012 MacBook). Total execution time is now 116 seconds and that breaks down to 29.6 seconds of overhead before the loop, and then the four loops through w_beam are 9.6, 8.4, 30.9 and 37.5 seconds. My version runs in 47.7 second total, with the four times shown in the question (and no overhead). Could you mention the "faster" time that you obtained and some information on the computer used? As I mention in the question I think my speed is limited by memory paging. — uhoh
– uhoh, Commented Jan 7, 2020 at 17:15
Odd. I'm benchmarking by running the script and not doing anything else with the computer. With N=100, your version runs in 35s and mine in 7s and with N=200, yours in 140s and mine in 29s. Perhaps you're using an old version of Numpy which doesn't do broadcasting as well? My Numpy version is 1.17.4. — Gaslight Deceive Subvert
– Gaslight Deceive Subvert, Commented Jan 7, 2020 at 17:38
see this comment for my Numpy versions. I get similar total times in 1.17.3 and the older 1.13.0 (in Python 2). But things can still suddenly change as in this SciPy update . I'll look into updating from 1.17.3 to 1.17.4 tomorrow; I'll feel safer being around some developers when I try. — uhoh
– uhoh, Commented Jan 7, 2020 at 17:44
The running time will probably depend on your available memory and CPU L1 cache size, since in the OP only an M x N array had to be kept in memory, but here it is an M x N x 50 x 50 array, 50 being the default number of values produced by linspace. — Graipher
– Graipher, Commented Jan 8, 2020 at 7:29
@Graipher yes I think that the larger array combined with my laptop (including CPU) being circa 2012 and having low remaining disk space might be why this is so much slower for me on this particular computer. — uhoh
– uhoh, Commented Jan 8, 2020 at 12:13

Stack Exchange Network

improving speed of this numpy-based diffraction calculator

1 Answer 1

You must log in to answer this question.

Hot Network Questions

improving speed of this numpy-based diffraction calculator

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions