It can be done simply with 8 ANDs, 8 SHIFTs, and 6 ORs.
A naive counting of the delta-swap's operations is 1 AND, 2 SHIFTs and 3 XORs which if called 4 times yields 4 ANDs, 8 SHIFTs, and 12 XORs for the whole rotation. So the optimum may well depend upon the particular instructions' speeds on a particular hardware implementation. More ANDs and less (X)ORs or vice versa?
uint64_t
rotateXY_CW( uint64_t b ){
uint64_t c =
(b & 0x0033003300330033) << 2 |
(b & 0x00cc00cc00cc00cc) << 8 |
(b & 0xcc00cc00cc00cc00) >> 2 |
(b & 0x3300330033003300) >> 8
;
uint64_t d =
(c & 0x0505050505050505) << 1 |
(c & 0x0a0a0a0a0a0a0a0a) << 4 |
(c & 0xa0a0a0a0a0a0a0a0) >> 1 |
(c & 0x5050505050505050) >> 4
;
return d;
}
There is literature on this subject but it may not be immediately obvious that the results are applicable here. A bit-cube can be considered as a stack of bit-planes. And rotation of a bitmap is a topic that underwent much study at Xerox PARC when they invented WIMP and GUI. An algorithm is presented in the [Byte Magazine 1981 SmallTalk Issue] (https://archive.org/details/byte-magazine-1981-08) on page 188.
As the image at the bottom of that page illustrates, the steps of the algorithm are to:
- rotate the quadrants
- rotate the quadrants of the quadrants
- rotate the quadrants of the quadrants of the quadrants
- ..... until the quadrants just moved were pixel-sized, then we're done.
More details
So, assume for simplicity that we have a 4x4 bitmap to rotate, mapped into 16 bits.
With 4 bits in a row, we only need 2 steps of taking quadrants before we're down to pixels.
uint16_t x = 0x1234;
x = (x & 0x0033) << 2
| (x & 0x00cc) << 8
| (x & 0xcc00) >> 2
| (x & 0x3300) >> 8;
x = (x & 0x0505) << 1
| (x & 0x0a0a) << 4
| (x & 0xa0a0) >> 1
| (x & 0x5050) >> 4;
Notice that in the second step, operations parallelize over all sub-quadrants by selecting more pieces with the mask.
0 1 2 3
>> > >
0 0 0 1 0 0 1 0 0 1 0V 1 0V 1 0 1 0
1 1 1 0 0 1 0 1 1 ^0<1 ^1<0 0 1 1 0
^^ VV
2 0 1 0 0 0 0 1 0 0>0V 0>1V 0 0 0 1
3 1 0 0 0 0 0 0 0 ^0 0 ^0 0 0 0 0 0
<< < <
Masks for each quadrant:
1100 0011 0000 0000
1100 0011 0000 0000
0000 0000 0011 1100
0000 = 0033 0000 = 00cc 0011 = cc00 1100 = 3300
Shifting each mask yields the next mask
0033 << 2 == 00cc
00cc << 8 == cc00
cc00 >> 2 == 3300
3300 >> 8 == 0033
Masks for each sub-quadrant
1010 0101 0000 0000
0000 0000 0101 1010
1010 0101 0000 0000
0000 = 0505 0000 = 0a0a 0101 = a0a0 1010 = 5050
Now watch what happens if we extend this to a stack of 2 bit-planes. 2 4x4 bitmaps in 32 bits.
uint32_t x = 0x00001234;
x = (x & 0x00330033) << 2
| (x & 0x00cc00cc) << 8
| (x & 0xcc00cc00) >> 2
| (x & 0x33003300) >> 8;
x = (x & 0x05050505) << 1
| (x & 0x0a0a0a0a) << 4
| (x & 0xa0a0a0a0) >> 1
| (x & 0x50505050) >> 4;
The number of operations doesn't increase! only the lengths of the masks increase.
And by replicating the mask from the 4x4 code 4 times it applies to a 4x4x4 cube.
Reversing the direction
Rotating in the other direction can be done by changing the directions of all the shifts.
uint64_t
rotateXY_CCW( uint64_t b ){
uint64_t c =
(b & 0x0033003300330033) << 8 |
(b & 0x3300330033003300) << 2 |
(b & 0xcc00cc00cc00cc00) >> 8 |
(b & 0x00cc00cc00cc00cc) >> 2 |
;
uint64_t d =
(c & 0x0505050505050505) << 4 |
(c & 0x5050505050505050) << 1 |
(c & 0xa0a0a0a0a0a0a0a0) >> 4 |
(c & 0x0a0a0a0a0a0a0a0a) >> 1
;
return d;
}
Rotating on a different axis
uint64_t
rotateXZ_CW( uint64_t b ){
uint64_t c =
(b & 0x0000000033333333) << 2 |
(b & 0x00000000cccccccc) << 32 |
(b & 0xcccccccc00000000) >> 2 |
(b & 0x3333333300000000) >> 32
;
uint64_t d =
(b & 0x0000555500005555) << 1 |
(b & 0x0000aaaa0000aaaa) << 16 |
(b & 0xaaaa0000aaaa0000) >> 1 |
(b & 0x5555000055550000) >> 16
;
return d;
}
uint64_t
rotateXZ_CCW( uint64_t b ){
uint64_t c =
(b & 0x0000000033333333) << 32 |
(b & 0x3333333300000000) << 2 |
(b & 0xcccccccc00000000) >> 32 |
(b & 0x00000000cccccccc) >> 2
;
uint64_t d =
(b & 0x0000555500005555) << 16 |
(b & 0x5555000055550000) << 1 |
(b & 0xaaaa0000aaaa0000) >> 16 |
(b & 0x0000aaaa0000aaaa) >> 1
;
return d;
}
quadrant masks
1100 1100 0000 0000
1100 1100 0000 0000
1100 1100 0000 0000
1100 1100 0000 0000
0011 0011 0000 0000
...
0000 0000 0011 0011
...
0000 0000 1100 1100
...
subquadrant masks
1010 0000 1010 0000
0101 0000 0101 0000
0000 0101 0000 0101
0000 1010 0000 1010
Rotating on the other, other axis
This one was harder for me to wrap my head around, but my ascii art of the masks seems to make sense.
uint64_t
rotateYZ_CW( uint64_t b ){
uint64_t c =
(b & 0x0000000000ff00ff) << 32 |
(b & 0x00ff00ff00000000) << 8 |
(b & 0xff00ff0000000000) >> 32 |
(b & 0x00000000ff00ff00) >> 8
;
uint64_t d =
(c & 0x00000f0f00000f0f) << 16 |
(c & 0x0f0f00000f0f0000) << 4 |
(c & 0xf0f00000f0f00000) >> 16 |
(c & 0x0000f0f00000f0f0) >> 4
;
return d;
}
uint64_t
rotateYZ_CCW( uint64_t b ){
uint64_t c =
(b & 0x0000000000ff00ff) << 8 |
(b & 0x00000000ff00ff00) << 32 |
(b & 0xff00ff0000000000) >> 8 |
(b & 0x00ff00ff00000000) >> 32
;
uint64_t d =
(c & 0x00000f0f00000f0f) << 4 |
(c & 0x0000f0f00000f0f0) << 16 |
(c & 0xf0f00000f0f00000) >> 4 |
(c & 0x0f0f00000f0f0000) >> 16
;
return d;
}
YZ quadrant masks
1111 1111 0000 0000
1111 1111 0000 0000
0000 0000 0000 0000
0000 0000 0000 0000
0000 0000 1111 1111
0000 0000 1111 1111
0000 0000 0000 0000
0000 0000 0000 0000
0000 0000 0000 0000
0000 0000 0000 0000
0000 0000 1111 1111
0000 0000 1111 1111
0000 0000 0000 0000
0000 0000 0000 0000
1111 1111 0000 0000
1111 1111 0000 0000
subquadrant masks
1111 0000 1111 0000
0000 0000 0000 0000
1111 0000 1111 0000
0000 0000 0000 0000
0000 1111 0000 1111
0000 0000 0000 0000
0000 1111 0000 1111
0000 0000 0000 0000
0000 0000 0000 0000
0000 1111 0000 1111
0000 0000 0000 0000
0000 1111 0000 1111
0000 0000 0000 0000
1111 0000 1111 0000
0000 0000 0000 0000
1111 0000 1111 0000