I am not sure whether I aligned the stack pointer correctly before each call instruction. I believe I did so.
For the call to f, I believe you did. But that's the least important one: it would have worked regardless.
By the way instead of sub rsp, 8 you could do a dummy push, just push any register, that's smaller in terms of code size and may avoid some stack pointer synchronization µops. As quoted from Agner Fogs microarchitecture document:
The modification of the stack pointer by PUSH, POP, CALL and RET instructions is done by a special stack engine, which is placed immediately after the decoding stage in the pipeline and probably before the μop cache. This relieves the pipeline from the burden of μops that modify the stack pointer. This mechanism saves two copies of the stack pointer: one in the stack engine and another one in the register file and the out-of-order core. These two stack pointers may need to be synchronized if a sequence of PUSH, POP, CALL and RET instructions is followed by an instruction that reads or modifies the stack pointer directly, such as ADD ESP,4 or MOV EAX,[ESP+8]. The stack engine inserts an extra stack-synchronization μop in every case where synchronization of the two stack pointers is needed.
Those stack-synchronization μops are usually not a big deal, but if you can avoid them for free, you may as well.
For printf the alignment may actually matter but after add rsp, 24 the alignment is off again, and there is nothing after that to restore it. By the way, it is more usual in x64 code to allocate some stack space once in the function prologue and then don't change rsp in the function body (except by the call instruction), unlike 32-bit x86 where it was common to do pushes and pops and add esp, 24 etc randomly throughout the function body (not really random, but you get the point).
For f you have some stack manipulation, that's fine but it's a leaf function so (depending on the registers used etc) it may be possible to avoid that. As a point of comparison, GCC managed to produce this code:
f(long, long, long, long, long, long, long, long):
imul rdi, rsi
imul rdi, rdx
imul rdi, rcx
imul rdi, r8
imul rdi, r9
imul rdi, QWORD PTR [rsp+8]
imul rdi, QWORD PTR [rsp+16]
lea rax, [3+rdi*4]
ret
Which I think is quite nice, for a compiler.
You don't need to save r12 and r13 if you don't modify them, and you can avoid modifying them. For xx and yy it seems like you need them twice so I understand that you save them (you could have copied them to other registers though, that's generally preferred over saving them to the stack). Writing xx + yy + zz in terms of xx, we get (xx + (xx + 1)) + ((xx + 1) * 2) which simplifies to 4 * xx + 3, which is what GCC has done with lea on the second to last line. That avoids some complexity, and also lets it move the result to rax for free as a bonus.