9

Is there a performant way to convert a numpy array to a FORTRAN ordered ctypes array, ideally without a copy being required, and without triggering problems related to strides?

import numpy as np

# Sample data
n = 10000
A = np.zeros((n,n), dtype=np.int8)
A[0,1] = 1

def slow_conversion(A):
    return np.ctypeslib.as_ctypes(np.ascontiguousarray(A.T))

assert slow_conversion(A)[1][0] == 1

Performance analysis of just the call to as_ctypes:

%%timeit
np.ctypeslib.as_ctypes(A)

3.35 µs ± 10.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Performance analysis of the supplied (slow) conversion

%%timeit
slow_conversion(A)

206 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The ideal outcome would be to get performance similar to that of just the as_ctypes call.

4
  • 2
    How and when to make a transpose for speedup is not a trivial question and depends a lot on the underlying hardware. Please have a look at this similar question stackoverflow.com/questions/16737298/… Commented Aug 28, 2022 at 11:32
  • 2
    do you want to convert a C-ordered "numpy array to a FORTRAN ordered ctypes array"? Because if you have an F-ordered numpy array already, then this conversion can be made fast (see my comment to the @Stephan 's answer). Commented Aug 29, 2022 at 7:47
  • Can it be assumed that the input is square? Commented Aug 29, 2022 at 8:18
  • 1
    @norok2 both the square and non-square cases are of relevant. Regardless, still interested to see if there is a special case that handles the case where the input is square Commented Aug 29, 2022 at 11:26

3 Answers 3

8
+500

By default, Numpy creates C-ordered arrays (because it is written in C), that is, row-major arrays. Doing a transposition with A.T creates a view of the array where the strides are reversed (ie. no copy). That being said, np.ascontiguousarray does a copy because the array is now not contiguous anymore and copies are expensive. This is why slow_conversion is slow. Note that the contiguity can be tested with yourarray.flags['F_CONTIGUOUS'] and by checking yourarray.strides. Note also that yourarray.flags and yourarray.__array_interface__ provide information about whether the array has been copied and also information about strides.

np.asfortranarray returns an array laid out in Fortran order in memory regarding the documentation. It can perform a copy if needed. In fact, np.asfortranarray(A) does a copy while np.asfortranarray(A.T) does not. You can check the C code of the function for more information about this behaviour. Since both are seen as FORTRAN-contiguous, it is better to use np.asfortranarray(A.T) that do not make any copy in this case.

Regarding ctypes, it deals with C arrays which are stored with a row-major ordering as opposed to FORTRAN array that are stored with a column-major ordering. Furthermore, C arrays do not support strides natively as opposed to FORTRAN ones. This means a row is basically a memory view of contiguous data stored in memory. Since slow_conversion(A)[1][0] == 1 is required to be true, this means the returned object should have the first item of the second row is 1 and thus that the columns are mandatory stored contiguously in memory. The thing is that the initial array is not FORTRAN-contiguous but C-contiguous and thus a transposition is required. Transpositions are pretty expensive (although the Numpy implementation is suboptimal).

Assuming you do not want to pay the overhead of a copy/transposition, the problem need to be relaxed. There are few possible options:

  • Create FORTRAN ordered array directly with Numpy using for example np.zeros((n,n), dtype=np.int8, order='F'). This create a C array with transposed strides so to behave like a FORTRAN array where computations operating on columns are fast (remember that Numpy is written in C so row-major ordered array are the reference). With this, the first row in ctypes is actually a column. Note that you should be very careful when mixing C and FORTRAN ordered array for sake of performance as non-contiguous access are much slower.
  • Use a strided FORTRAN array. This solution basically means that basic column-based computations will be much slower and one need to write row-based computation that are quite unusual in FORTRAN. You can extract the pointer to the C-contiguous array with A.ctypes.data_as(POINTER(c_double)), the strides with A.strides and the shape with A.shape. That being said, this solution appears not to be really portable/standard. The standard way appears to use C binding in FORTRAN. I am not very familiar with this but you can find a complete example in this answer.

A last solution is to transpose data manually using a fast transposition algorithm. An in-place transposition is faster than an out-of-place transposition (typically 2 times) but this requires a squared array and it mutates the input array that should not be used later (unless it is fine to operate on a transposed array). A fast transposition cannot be written using Numpy directly. One solution is to do in in Numba (this is a sub-optimal out-of-place solution but pretty fast already), or to do it in C or directly in FORTRAN (using a wrapper function in all cases). This should be significantly faster than what Numpy does but still far slower than a basic ctypes wrapping.

Sign up to request clarification or add additional context in comments.

5 Comments

I tried the Numba solution you propose in the link, but I did not manage to improve over the dull np.ascontiguosarray(A.T).
@norok2 This is very surprising though it is dependent on the machine. On a 10 core Xeon Skylake server, ascontiguousarray takes 605 ms compared to 14 ms for the Numba code. Did you test with a C-ordered array? Did Numba print warning about optimization/multithreading and is it actually using multiple cores? What are your timings? Also note that the provided Numba code is actually an out-of-place solution that should be about twice slower on most machines (especially x86-64 processors) than the in-place one.
I just tried in a Google Colab on a C-ordered array. Not sure what the underlying architecture is. I got ~100ms with np.ascontiguosarray(A.T) and at best ~110ms with the Numba code by tweaking the sizes (can't remember which one gave best, but it was more or less casual sampling).
@norok2 Collab is clearly not a good platform to measure performance. Codes are running in a virtual machine shared with other users so results can be very different over time. Additionally, it has only 2 vCPU which is very unusual on modern machines (even my ~10 year-old mid-end notebook had 4 cores). The memory is slow (8-12 GiB/s while modern PCs reach at least 20 GiB/s (35-40 GiB/s for mine) and can reach up to 60 GiB/s for recent high-end ones, or even more on M1 Macs). See this.
@norok2 I tried this on a Collab instance and ascontiguousarray sometime ran in 300 ms and sometimes it took up to 2 seconds. The Numba code was running in 120-180 ms so it was about twice faster. I tried on my own Collab instance and got 1-1.3 secondes for ascontiguousarray as opposed to 140-260 ms for Numba. While results are very instable, it is a clear win for the Numba code. Here is the Collab instance.
4

If your starting array must be C-ordered:

As you have pointed out, np.ctypeslib.as_ctypes(...) is fast.

The bottleneck of your computation is np.ascontiguousarray(A.T) - which is equivalent to np.asfortranarray(A) which is equally slow.


This leads me to believe that this can't be made faster using only numpy. I mean, since a whole dedicated function to do it already exists ( np.asfortranarray ) - we'd assume it has been optimized to have the best possible performance.

If you can start with an F-ordered array:

In this case, problem solved! You can get the performance just like that of the as_ctypes call because the array is already in the desired order in memory:

n = 10000
A = np.zeros((n, n), dtype=np.int8, order='F')
A[0, 1] = 1

def conversion(A):
    return np.ctypeslib.as_ctypes(np.ascontiguousarray(A.T))

assert conversion(A)[1][0] == 1
%%timeit
conversion(A)

# 4.76 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Thanks to @JérômeRichard and @StephanSchlecht for inspiring this idea. Read @JérômeRichard's answer for more insights!

2 Comments

I'm just curious, isn't fortran order array a tranpose of contiguous array?
@Kevin, @jkr, print(np.asfortranarray(A)) prints the same as just print(A). What is different between them is the order in which their elements are laid out in memory.
1

There is one aspect that can be improved about it.

The operation is not only making a copy, but because it is loading and storing to main memory instead of using cache.

Usually processors will access multiple blocks multiple contiguous bytes whenever memory and use it from cache. If you run out of cache space some old block is evicted.

For sake of the argument let's say that your CPU works on blocks of 8 bytes and that the rows are contiguous. In one matrix you will access columns and the other you will access rows. When you are writing down a column you are loading multiple columns but updating only one. This overhead of one single column can be seen by copying a few columns

n = 2**14
A = np.random.randint(0, 100, (n,n), dtype=np.int8)
B = np.empty_like(A)
%%timeit
B[:1,:] = A[:1,:]
%%timeit
B[:4,:] = A[:4,:]

If you do the same on the rows you should notice something roughly linear. If you copy columns the cost of copying one column is very close to the cost of copying two columns or even 8, or 16, depending on the hardware.

I will use n=2**14 to make the things easier, but the principle applies to any dimension.

  • If you have a small enough let's say 8 x 8 the entire matrix fit in the cache, so you can transpose it without accessing any cache.
  • If you have are copying large continuous block of data even if you can't do the entire operation on cache you reduce the number of times a given data will be loaded from/to memory again.

Based on this what I tried is to rearrange the matrix in a matrix of smaller contiguous blocks, first I transpose the elements in a block and then the blocks in the matrix.

For the baseline

B = np.ascontiguousarray(A.T)
3.12 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using 8x8 blocks

T0 = A.reshape(2048,8,2048,8)
T1 = np.ascontiguousarray(T0.transpose(0,2,3,1))
T2 = np.ascontiguousarray(T1.transpose(1,0,2,3))
T3 = np.ascontiguousarray(T2.transpose(0,2,1,3))
B = T3.reshape(A.shape)
786 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
assert np.all(B == A.T) # 2.8s

It is still 200x slower than a simple copy, but it is already 4x faster than the original approach.

Allocating only two instead of 3 temporary arrays help as well

T0 = np.empty_like(A)
T1 = np.empty_like(A)
T0.reshape(2048,2048,8,8)[:] = A.reshape(2048,8,2048,8).transpose(0,2,3,1)
T1.reshape(2048,2048,8,8)[:] = T0.reshape(2048,2048,8,8).transpose(1,0,2,3)
T0.reshape(2048,8,2048,8)[:] = T1.reshape(2048,2048,8,8).transpose(0,2,1,3)
B = T0
686 ms ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.