Skip to content

Slow indirect function calls with 16-byte by-pointer enum argument #143050

Open
@y21

Description

@y21

(The title is intentionally rather general even though this does seem somewhat specific, since I'm not 100% sure on whether this is something that could be fixed on the rustc side or LLVM side, or if my analysis is actually correct; but regardless, it was a bit unexpected to me that a function taking a 16-byte enum would have any more overhead than other 16-byte types so I filed it as a bug)

While I was profiling a program that makes somewhat heavy use of indirect function calls that can't be inlined, I noticed that a non-trivial amount of time was spent on movups/movaps instructions to copy a 16-byte enum argument that was being passed by pointer.

I believe I reduced the slowdown down to a function that takes a 16-byte enum as an argument (that rustc passes by pointer), and its caller that constructs it on the stack.

#[inline(never)]
pub fn byptr(val: Result<u64, u32>) {
  std::hint::black_box(val); // force a load of the parameter which is passed by pointer
}

pub fn test() {
  byptr(Ok(1));
}

Godbolt

Here's a full reproducer where the call is wrapped in a tight loop that can be run and shows the slowdown, compared to another function that takes a different 16-byte type also passed by pointer, which doesn't have this issue.
Running that locally, 1 consistently performs about 5x worse than 2 (375ms vs 70ms) on an AMD Ryzen 3 PRO 3200G.


Now this is more or less an educated guess on what could be the cause (I don't have a machine at hand that can run perf with hardware counters), but looking at the asm, as mentioned above it uses a movups to load the 16 byte enum into xmm0:

example::byptr::hf156bd35b01c5a6e:
        movups  xmm0, xmmword ptr [rdi]

At callsite however, to initialize that Ok(1) value on the stack, it does a 32-bit store at [rdi] for the discriminant and another 64-bit store at [rdi + 8] for the payload:

example::test::h8fec37f4e1b51032:
...
        mov     qword ptr [rsp + 16], 1
        mov     dword ptr [rsp + 8], 0
        lea     rdi, [rsp + 8]

Could the slowdown come from a failed store-to-load forward? As far as I know, this pattern of making two smaller stores to create a large value, then loading the large value at once, is a case that store-to-load forwarding can't handle since there's no entry in the store buffer with a matching start address and a greater or equals size, and results in a perf degradation.

If I use (u32, u32, u32, u32) instead of Result<u64, u32> I see that it uses a 128 bit store/load via movups/movaps on both sides where a fast store-to-load presumably succeeds, and as the repro above showed this is significantly faster.

And likewise, if I change the function to explicitly take an &Result<u64, u32> and initializing it manually with a movups makes it fast again, or if I change the function to use two movs. So basically, making sure the store/loads match up in their sizes makes the perf degradation go away.

So my question would be, is there something that's preventing Result<u64, u32> from just being passed by-value as an i128 instead of by memory? A very quick bisection on godbolt shows that it did do that up until 1.60 for the code above. In 1.61 it started passing it by pointer. Interestingly, Result<u64, u64> does get passed by-value.
Or is this something that would be better fixed on the LLVM side, like not splitting up stores/loads like that? Assuming that this is actually is the issue.

Meta

rustc 1.89.0-nightly (cdd545be1 2025-06-07)
binary: rustc
commit-hash: cdd545be1b4f024d38360aa9f000dcb782fbc81b
commit-date: 2025-06-07
host: x86_64-unknown-linux-gnu
release: 1.89.0-nightly
LLVM version: 20.1.5

(but it really reproduces with any rustc after 1.61)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-ABIArea: Concerning the application binary interface (ABI)C-bugCategory: This is a bug.needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions