A naive counting of the delta-swap's operations is 1 AND, 2 SHIFTs and 3 XORs which if called 4 times yields 4 ANDs, 8 SHIFTs, and 12 XORs for the whole rotation. So the optimum may well depend upon the particular instructions' speeds on a particular hardware implementation. More ANDs and less (X)ORs or vice versa?