There is one aspect that can be improved about it.
The operation is not only making a copy, but because it is loading and storing to main memory instead of using cache.
Usually processors will access multiple blocks multiple contiguous bytes whenever memory and use it from cache. If you run out of cache space some old block is evicted.
For sake of the argument let's say that your CPU works on blocks of 8 bytes and that the rows are contiguous. In one matrix you will access columns and the other you will access rows. When you are writing down a column you are loading multiple columns but updating only one. This overhead of one single column can be seen by copying a few columns
n = 2**14
A = np.random.randint(0, 100, (n,n), dtype=np.int8)
B = np.empty_like(A)
%%timeit
B[:1,:] = A[:1,:]
%%timeit
B[:4,:] = A[:4,:]
If you do the same on the rows you should notice something roughly linear. If you copy columns the cost of copying one column is very close to the cost of copying two columns or even 8, or 16, depending on the hardware.
I will use n=2**14 to make the things easier, but the principle applies to any dimension.
- If you have a small enough let's say 8 x 8 the entire matrix fit in the cache, so you can transpose it without accessing any cache.
- If you have are copying large continuous block of data even if you can't do the entire operation on cache you reduce the number of times a given data will be loaded from/to memory again.
Based on this what I tried is to rearrange the matrix in a matrix of smaller contiguous blocks, first I transpose the elements in a block and then the blocks in the matrix.
For the baseline
B = np.ascontiguousarray(A.T)
3.12 s ± 446 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using 8x8 blocks
T0 = A.reshape(2048,8,2048,8)
T1 = np.ascontiguousarray(T0.transpose(0,2,3,1))
T2 = np.ascontiguousarray(T1.transpose(1,0,2,3))
T3 = np.ascontiguousarray(T2.transpose(0,2,1,3))
B = T3.reshape(A.shape)
786 ms ± 54.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
assert np.all(B == A.T) # 2.8s
It is still 200x slower than a simple copy, but it is already 4x faster than the original approach.
Allocating only two instead of 3 temporary arrays help as well
T0 = np.empty_like(A)
T1 = np.empty_like(A)
T0.reshape(2048,2048,8,8)[:] = A.reshape(2048,8,2048,8).transpose(0,2,3,1)
T1.reshape(2048,2048,8,8)[:] = T0.reshape(2048,2048,8,8).transpose(1,0,2,3)
T0.reshape(2048,8,2048,8)[:] = T1.reshape(2048,2048,8,8).transpose(0,2,1,3)
B = T0
686 ms ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)