Return to Answer

added 403 characters in body

Source Link

edited May 7, 2024 at 15:47

12.4k
1
19
39

Which I think is quite nice, for a compiler. (for a human, I'd probably expect them to chop up that long dependency chain through all the imul a bit to lower the latency, there's no downside)

You don't need to save r12 and r13 if you don't modify them, and you can avoid modifying them. For xx and yy it seems like you need them twice so I understand that you save them (you could have copied them to other registers though, that's generally preferred over saving them to the stack). Writing xx + yy + zz in terms of xx, we get (xx + (xx + 1)) + ((xx + 1) * 2) which simplifies to 4 * xx + 3, which is what GCC has done with lea on the second to last line. That avoids some complexity, and also lets it move the result to rax for free as a bonus.

I ended the program by making call to exit function. I am not sure if it is the best way to do it.

That's OK. Alternatively, since you're in main (as opposed to writing your own entrypoint, commonly called _start) you can also just return.

Which I think is quite nice, for a compiler.

Which I think is quite nice, for a compiler. (for a human, I'd probably expect them to chop up that long dependency chain through all the imul a bit to lower the latency, there's no downside)

I ended the program by making call to exit function. I am not sure if it is the best way to do it.

That's OK. Alternatively, since you're in main (as opposed to writing your own entrypoint, commonly called _start) you can also just return.

Source Link

answered May 7, 2024 at 13:48

user555045

12.4k
1
19
39

I am not sure whether I aligned the stack pointer correctly before each call instruction. I believe I did so.

For the call to f, I believe you did. But that's the least important one: it would have worked regardless.

By the way instead of sub rsp, 8 you could do a dummy push, just push any register, that's smaller in terms of code size and may avoid some stack pointer synchronization µops. As quoted from Agner Fogs microarchitecture document:

The modification of the stack pointer by PUSH, POP, CALL and RET instructions is done by a special stack engine, which is placed immediately after the decoding stage in the pipeline and probably before the μop cache. This relieves the pipeline from the burden of μops that modify the stack pointer. This mechanism saves two copies of the stack pointer: one in the stack engine and another one in the register file and the out-of-order core. These two stack pointers may need to be synchronized if a sequence of PUSH, POP, CALL and RET instructions is followed by an instruction that reads or modifies the stack pointer directly, such as ADD ESP,4 or MOV EAX,[ESP+8]. The stack engine inserts an extra stack-synchronization μop in every case where synchronization of the two stack pointers is needed.

Those stack-synchronization μops are usually not a big deal, but if you can avoid them for free, you may as well.

For printf the alignment may actually matter but after add rsp, 24 the alignment is off again, and there is nothing after that to restore it. By the way, it is more usual in x64 code to allocate some stack space once in the function prologue and then don't change rsp in the function body (except by the call instruction), unlike 32-bit x86 where it was common to do pushes and pops and add esp, 24 etc randomly throughout the function body (not really random, but you get the point).

For f you have some stack manipulation, that's fine but it's a leaf function so (depending on the registers used etc) it may be possible to avoid that. As a point of comparison, GCC managed to produce this code:

f(long, long, long, long, long, long, long, long):
        imul    rdi, rsi
        imul    rdi, rdx
        imul    rdi, rcx
        imul    rdi, r8
        imul    rdi, r9
        imul    rdi, QWORD PTR [rsp+8]
        imul    rdi, QWORD PTR [rsp+16]
        lea     rax, [3+rdi*4]
        ret

Which I think is quite nice, for a compiler.