I've done some very basic bench marking for the following:
#include <stddef.h>
#include <stdlib.h>
#define LOOP_CNT 1000000000
int main(void) {
int a, b, c;
for (size_t i = 0; i < LOOP_CNT; i++) {
#ifdef TMP
c = a;
a = b;
b = c;
#endif
#ifdef XOR
a ^= b;
b ^= a;
a ^= b;
#endif
}
return EXIT_SUCCESS;
}
Where I use -DTMP or -DXOR, -g, and -O0 flags with both gcc and icx.
Here are the results:
$ time ./swap_gcc_xor
real 0m3.721s
user 0m3.721s
sys 0m0.000s
$ time ./swap_icx_xor
real 0m0.974s
user 0m0.974s
sys 0m0.000s
$ time ./swap_gcc_tmp
real 0m0.889s
user 0m0.889s
sys 0m0.001s
$ time ./swap_icx_tmp
real 0m0.836s
user 0m0.832s
sys 0m0.005s
Why is the TMP version generally faster? Both versions have data hazards, and a XOR instruction is trivial for the ALU.
I have self-studied CPU micro-architecture, so please answer technically.