3

I'm writing a JIT runtime, and I've started thinking about shorter variants of some instructions.

In x64 we can do at least two kinds of jumps: rel8 and rel32. The first one takes say 2 bytes (opcode + 8-bit offset) while the second one 5 bytes (opcode + 32-bit offset). Of course the second one can handle wider range of offsets.

Btw, I'm talking here about raw jumps, but this applies to conditional jumps as well.

The short jump seems to faster. Or rather that's what I've been told. And makes instructions shorter which nicely improves caching. However we still may need longer jumps. And decision on when to chose one requires additional processing. Which is either complicated or inefficient (in terms of compile time, which is important in the context of JITting), or both. The reason is that consecutive forward jumps affect each other depending on the length they take.

An alternative is to always use long jumps. Which makes code generation fairly simple.

So the question is: is using long jumps, even when valid short jump can be used, always inefficient? In theory it should be. However I've done some tests and I couldn't find any performance benefits from using short jumps. But maybe my tests were flawed.

Another idea is to choose short vs long, but keep the buffer as it is, without processing. So we generate all jumps as long first, and when detected that short can be used, use it and fill the remaining 3 bytes with nop instruction. Would that bring any benefit over long jumps? Again, my tests seem to show that long jumps are just as effective.

Maybe it depends on hardware? I've been testing this on my personal machine, some Ryzen 7 CPU. Although I'm mostly interested in modern CPUs.

Can anyone shed light on this?


// EDIT: Here's an explanation why "just pick shorter version when you have shorter jump" is actually not trivial at all. Consider the following sequence of instructions:

B:
...
jmp A
...
jmp B
...
A:
...

Say distance from jmp A to A is 100. So it fits in 2-byte short jump. And it is a forward jump. Now say distance from B to jmp B is 120. So it also fits in 2-byte short jump. And it is a backwards jump. Everything's fine, we write those relative distances to machine code. And so jmp A is translated into jmp 100 in machine code. Remember that machine code operates in bytes, not in labels. But no sane language will operate in bytes, not even assemblers do.

But what if jmp B is actually 500 bytes distant from B? Then it means it has to fit into 5-byte jump. But then we have 3 additional bytes between jmp A and A, and so we need to patch jmp A in machine code. Previously we've emitted jmp 100 but now we need jmp 103.

Moreover, if we put 10 such jmp B instructions between jmp A and A, then now we need to patch jmp A into jmp 130. But 130 doesn't fit 8-bit (signed) integer. And so jmp A has to be widened to 5 bytes. But then all those jmp B need to be patched as well (because jmp A lies between B and jmp B). Oh, what a headache... Note that none of those problems exist if we emit long jumps only.

Solving that is either complicated or inefficient, or both. I'm linking an article that explains a potential solution, and how LLVM deals with it: https://eli.thegreenplace.net/2013/01/03/assembler-relaxation I might actually implement it, but honestly I'm not even sure this is worth it. If I emit long (or rather: fixed size) jumps always, then emitting machine code is just simple, I don't need to deal with resizes and patching at all.


// EDIT 2:

Here's how I did my tests on both those variants. The code is in Rust. I use region lib for handling executable buffer, and then I do:

fn fill_buffer<T: Write>(
    buffer: &mut T,
    number_of_nops: i32,
    add: &[u8]) -> usize
{
    let mut write = |slice| {
        buffer.write_all(slice).unwrap();
    };

    write(&[0x48, 0x31, 0xC0]);  // xor rax, rax

    let mut counter = 0;

    write(add);  // jmp -> add
    counter += add.len();

    for _ in 0..number_of_nops {
        write(&[0x90]);  // nop
        counter += 1;
    }

    write(add);  // again jmp -> add
    counter += add.len();

    write(&[0x48, 0x39, 0xf8]);  // cmp rax, rdi
    counter += 3;

    counter += 6;
    let neg = (-(counter as i32)).to_le_bytes();
    write(&[0x0f, 0x8c]);  // jl rel32 (to the initial jmp, i.e. loop it)
    write(&neg);

    write(&[0xc3]);  // ret
    return counter+4;
}

This function will be called with one of the two following binaries (as add argument):

const SHORT_ADD: &[u8] = &[
    0xeb, 0x02,  // jmp (rel8) +2
    0x90,  // nop
    0x90,  // nop
    0x48, 0x83, 0xc0, 0x01,  // add rax, 1
];

const LONG_ADD: &[u8] = &[
    0xe9, 0x02, 0x00, 0x00, 0x00,  // jmp (rel32) +2
    0x90,  // nop
    0x90,  // nop
    0x48, 0x83, 0xc0, 0x01,  // add rax, 1
];

So basically I create the following function:

fn func(rdi: i64) -> i64 {
    let mut result = 0;
    while result < rdi {
        result += 1;  // fancy jmp->add here
        // nops here
        result += 1;  // fancy jmp->add here
    }
    return result;
}

and the difference between variants is that the second one uses 6 bytes more (twice +3 bytes due to long jump). And then main:

fn main() {
    const SIZE: usize = 1 << 14;
    for i in 80..400 {
        unsafe {
            let number_of_nops = i;
            // Allocate executable memory
            let mut alloc = region::alloc(SIZE, region::Protection::READ_WRITE_EXECUTE).unwrap();

            // Convert to vec, which implements Write
            let buffer = alloc.as_mut_ptr::<u8>();
            let mut vec = Vec::from_raw_parts(buffer, 0, SIZE);
            let result = fill_buffer(&mut vec, number_of_nops, LONG_ADD);
            forget(vec);  // Don't run vec's destructor

            // Convert to function pointer
            let func: extern "sysv64" fn(i64)->i64 = core::mem::transmute(buffer);

            // and time it
            let rdi = 50000000;
            let now = SystemTime::now();
            let a = func(rdi);
            let now2 = SystemTime::now();

            // Display results
            let diff = now2.duration_since(now).unwrap().as_millis();
            println!("Result: {a}, RDI: {rdi}, number_of_nops: {number_of_nops}, took me: {diff}");
        }
    }
}

Finally I run those functions with RDI set to 50000000 so they loop enough times. Then I measure. The code increases in size, but not in complexity. At some point it will cross cache size limit.

The results are inconclusive. In fact the long variant is sometimes faster. The difference doesn't seem to be greater than 5%, but in both directions.

17
  • 2
    At first approximation, smaller code = better cache utilisation = faster. But beyond that, this is going to depend entirely on microarchitectural details. Agner's optimization manuals are a helpful guide to x86 based microarchitectures: agner.org/optimize Commented Dec 24, 2024 at 20:29
  • @amon yes, that's what I thought: smaller code = better cache utilisation. But in practice I just couldn't create sophisticated enough code where that actually mattered. And so I started doubting that claim. Commented Dec 24, 2024 at 20:49
  • "Worth it" is subjective, no? — especially as you have JIT compile performance and generated code quality to balance.. Commented Dec 24, 2024 at 22:33
  • On older CPUs you may not have a choice. Many of the 8 bit processors had only short conditional branch instructions. I'm unsure if this applies to modern processors. Commented Dec 25, 2024 at 1:00
  • 1
    Before you wring your hands over this too much be sure you understand this. Commented Dec 25, 2024 at 4:42

1 Answer 1

-3

Java byte code has variants. Like (from memory) “load variable at index 1”, “At index 2”, “at index 10”, “at index n”. Making your bytecode smaller may have small performance benefits.

Whether it is enough benefits for the complexity is up to you.

For jump instructions: Usually you produce single byte jumps, and if the jump distance is too long, you make the instruction longer instead. And repeat until everything fits.

14
  • The problem is that "making instruction longer instead" is not a simple task. When generating code we don't know the exact place we need to jump. This is especially true for forward jumps. And especially when multiple jumps are involved, each affecting each other (if I decide to make jump X longer, then previous jump may have to be altered as well, if it points at something beyond X). I have a rough idea of an algorithm that calculates that. But it requires allocations. With a fixed length jumps I don't allocate much, which makes code generation very fast. Again, important in context of JIT. Commented Dec 24, 2024 at 19:27
  • It’s easy. Generate the code with all jumps and length unknown. Initially make all jumps short. Find all those that jump too far for a short branch, make them long branches. Repeat in case some branch distances are longer now until all branches fit. Commented Dec 24, 2024 at 21:14
  • 1
    You just record “instruction 3, 12 and 25 are branches”. Then you guess they all fit into two bytes. Then check all branch distances under this assumption and mark those that don’t fit as four bytes. Finally you generate the code with all branches the correct size. The buffer is instructions, not bytes. Commented Dec 24, 2024 at 21:21
  • 1
    Resizing one instruction affects other, so they require patching. It is not simple to implement. And it likely requires additional allocations. I've updated my question. It addresses this issue now. Commented Dec 25, 2024 at 15:16
  • 1
    ... there is no easy answer saying you "this will save more cycles than it costs" except from "try this out, maybe make it optional for the user, or maybe find some nifty algorithm which is capable to decide by itself at runtime whether this optimization could be worth it". Commented Dec 26, 2024 at 8:50

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.