I'm writing a JIT runtime, and I've started thinking about shorter variants of some instructions.
In x64 we can do at least two kinds of jumps: rel8 and rel32. The first one takes say 2 bytes (opcode + 8-bit offset) while the second one 5 bytes (opcode + 32-bit offset). Of course the second one can handle wider range of offsets.
Btw, I'm talking here about raw jumps, but this applies to conditional jumps as well.
The short jump seems to faster. Or rather that's what I've been told. And makes instructions shorter which nicely improves caching. However we still may need longer jumps. And decision on when to chose one requires additional processing. Which is either complicated or inefficient (in terms of compile time, which is important in the context of JITting), or both. The reason is that consecutive forward jumps affect each other depending on the length they take.
An alternative is to always use long jumps. Which makes code generation fairly simple.
So the question is: is using long jumps, even when valid short jump can be used, always inefficient? In theory it should be. However I've done some tests and I couldn't find any performance benefits from using short jumps. But maybe my tests were flawed.
Another idea is to choose short vs long, but keep the buffer as it is, without processing. So we generate all jumps as long first, and when detected that short can be used, use it and fill the remaining 3 bytes with nop instruction. Would that bring any benefit over long jumps? Again, my tests seem to show that long jumps are just as effective.
Maybe it depends on hardware? I've been testing this on my personal machine, some Ryzen 7 CPU. Although I'm mostly interested in modern CPUs.
Can anyone shed light on this?
// EDIT: Here's an explanation why "just pick shorter version when you have shorter jump" is actually not trivial at all. Consider the following sequence of instructions:
B:
...
jmp A
...
jmp B
...
A:
...
Say distance from jmp A to A is 100. So it fits in 2-byte short jump. And it is a forward jump. Now say distance from B to jmp B is 120. So it also fits in 2-byte short jump. And it is a backwards jump. Everything's fine, we write those relative distances to machine code. And so jmp A is translated into jmp 100 in machine code. Remember that machine code operates in bytes, not in labels. But no sane language will operate in bytes, not even assemblers do.
But what if jmp B is actually 500 bytes distant from B? Then it means it has to fit into 5-byte jump. But then we have 3 additional bytes between jmp A and A, and so we need to patch jmp A in machine code. Previously we've emitted jmp 100 but now we need jmp 103.
Moreover, if we put 10 such jmp B instructions between jmp A and A, then now we need to patch jmp A into jmp 130. But 130 doesn't fit 8-bit (signed) integer. And so jmp A has to be widened to 5 bytes. But then all those jmp B need to be patched as well (because jmp A lies between B and jmp B). Oh, what a headache... Note that none of those problems exist if we emit long jumps only.
Solving that is either complicated or inefficient, or both. I'm linking an article that explains a potential solution, and how LLVM deals with it: https://eli.thegreenplace.net/2013/01/03/assembler-relaxation I might actually implement it, but honestly I'm not even sure this is worth it. If I emit long (or rather: fixed size) jumps always, then emitting machine code is just simple, I don't need to deal with resizes and patching at all.
// EDIT 2:
Here's how I did my tests on both those variants. The code is in Rust. I use region lib for handling executable buffer, and then I do:
fn fill_buffer<T: Write>(
buffer: &mut T,
number_of_nops: i32,
add: &[u8]) -> usize
{
let mut write = |slice| {
buffer.write_all(slice).unwrap();
};
write(&[0x48, 0x31, 0xC0]); // xor rax, rax
let mut counter = 0;
write(add); // jmp -> add
counter += add.len();
for _ in 0..number_of_nops {
write(&[0x90]); // nop
counter += 1;
}
write(add); // again jmp -> add
counter += add.len();
write(&[0x48, 0x39, 0xf8]); // cmp rax, rdi
counter += 3;
counter += 6;
let neg = (-(counter as i32)).to_le_bytes();
write(&[0x0f, 0x8c]); // jl rel32 (to the initial jmp, i.e. loop it)
write(&neg);
write(&[0xc3]); // ret
return counter+4;
}
This function will be called with one of the two following binaries (as add argument):
const SHORT_ADD: &[u8] = &[
0xeb, 0x02, // jmp (rel8) +2
0x90, // nop
0x90, // nop
0x48, 0x83, 0xc0, 0x01, // add rax, 1
];
const LONG_ADD: &[u8] = &[
0xe9, 0x02, 0x00, 0x00, 0x00, // jmp (rel32) +2
0x90, // nop
0x90, // nop
0x48, 0x83, 0xc0, 0x01, // add rax, 1
];
So basically I create the following function:
fn func(rdi: i64) -> i64 {
let mut result = 0;
while result < rdi {
result += 1; // fancy jmp->add here
// nops here
result += 1; // fancy jmp->add here
}
return result;
}
and the difference between variants is that the second one uses 6 bytes more (twice +3 bytes due to long jump). And then main:
fn main() {
const SIZE: usize = 1 << 14;
for i in 80..400 {
unsafe {
let number_of_nops = i;
// Allocate executable memory
let mut alloc = region::alloc(SIZE, region::Protection::READ_WRITE_EXECUTE).unwrap();
// Convert to vec, which implements Write
let buffer = alloc.as_mut_ptr::<u8>();
let mut vec = Vec::from_raw_parts(buffer, 0, SIZE);
let result = fill_buffer(&mut vec, number_of_nops, LONG_ADD);
forget(vec); // Don't run vec's destructor
// Convert to function pointer
let func: extern "sysv64" fn(i64)->i64 = core::mem::transmute(buffer);
// and time it
let rdi = 50000000;
let now = SystemTime::now();
let a = func(rdi);
let now2 = SystemTime::now();
// Display results
let diff = now2.duration_since(now).unwrap().as_millis();
println!("Result: {a}, RDI: {rdi}, number_of_nops: {number_of_nops}, took me: {diff}");
}
}
}
Finally I run those functions with RDI set to 50000000 so they loop enough times. Then I measure. The code increases in size, but not in complexity. At some point it will cross cache size limit.
The results are inconclusive. In fact the long variant is sometimes faster. The difference doesn't seem to be greater than 5%, but in both directions.