On a micro-architectural level, why is using a temporary variable faster than using XOR to swap two variables?

Question

I've done some very basic bench marking for the following:

#include <stddef.h>
#include <stdlib.h>

#define LOOP_CNT 1000000000

int main(void) {
        int a, b, c;

        for (size_t i = 0; i < LOOP_CNT; i++) {

#ifdef TMP
                c = a;
                a = b;
                b = c;
#endif

#ifdef XOR
                a ^= b;
                b ^= a;
                a ^= b;
#endif
        }

        return EXIT_SUCCESS;

}

Where I use -DTMP or -DXOR, -g, and -O0 flags with both gcc and icx.

Here are the results:

$ time ./swap_gcc_xor 
real    0m3.721s
user    0m3.721s
sys     0m0.000s

$ time ./swap_icx_xor                                                                                                        
real    0m0.974s
user    0m0.974s
sys     0m0.000s

$ time ./swap_gcc_tmp
real    0m0.889s
user    0m0.889s
sys     0m0.001s

$ time ./swap_icx_tmp                                                                                                       
real    0m0.836s
user    0m0.832s
sys     0m0.005s

Why is the TMP version generally faster? Both versions have data hazards, and a XOR instruction is trivial for the ALU.

I have self-studied CPU micro-architecture, so please answer technically.

To begin with, did you actually check the generated assembly to see what the machine thinks it is doing? — Emil Jeřábek
– Emil Jeřábek, Commented Oct 16 at 14:19
With the compiler optimization turned off, the code generated for each version of the test (tmp vs. xor) will be the most debugger friendly code, not the fastest code. As Emil pointed out, you really need to look at the assembly code generated by the compiler when no optimization is performed vs. that produced at the highest level of optimization. Your example code, when compiled at maximum optimization, may even result in no code at all, due to the fact that computations are performed but not used. — Steve Mathwig
– Steve Mathwig, Commented Oct 16 at 14:41

gnasher729 · Accepted Answer · 2025-10-16 17:13:00Z

1

Every version of your code invokes undefined behaviour, therefore the compiler is free to produce any code it wants. No conclusions can be drawn.

answered Oct 16 at 17:13

gnasher729

32.6k36 silver badges58 bronze badges

Add a comment |

Abhishek · Accepted Answer · 2025-10-17 04:55:22Z

0

It may be because we can extract ILP from instructions with temp variable as there are false dependencies. The one with xor has RAW dependencies.

answered Oct 17 at 4:55

Abhishek

1

$\begingroup$ It's not clear to me if this answers the question, or should be a comment. $\endgroup$

Caleb Stanford
– Caleb Stanford

2025-10-23 01:47:11 +00:00
Commented Oct 23 at 1:47

Add a comment |

Stack Exchange Network

On a micro-architectural level, why is using a temporary variable faster than using XOR to swap two variables?

2 Answers 2

Hot Network Questions

On a micro-architectural level, why is using a temporary variable faster than using XOR to swap two variables?

2 Answers 2

Related

Hot Network Questions