Why is 'H' / 72 / 0x48 the second most common byte in executables?

Question

^{(If the score of this question is 72, please don't upvote!)}

I ran this:

cat /usr/bin/* |
  perl -ne 'map {$a{$_}++} split//; END{print map { "$a{$_}\t$_\n" } keys %a}' |
  grep --text . | sort -n | plotpipe --log y {1}

and got this:

(Even with a log y-axis it still looks exponential! There is more than 100x between the top and the bottom)

Looking at the numbers:

:
31919597        ^H
32983719        ^B
33943030        ^O
39130281        \213
39893389        $
52237360        \211
53229196        ^A
76884442        \377
100776756       H
746405320       ^@

It is hardly surprising that ^@ (NUL) is the most common byte in executables. \377 (255) and ^A (1) also make intuitively sense to me.

But what causes 'H' (72) to be the second most common byte in executables - far more common than 255 and 1?

Background

For a Perl script, I needed to find the least common byte in Perl scripts. By accident, I didn't grep out only Perl scripts but ran the command on all binaries. I expected a few bytes to stand out, such as NUL, 1, and 255, but never 'H'.

The input for the graph is the count of each byte, sorted. The y-axis represents the count, and the x-axis represents the line number (1-256, as a byte can only take on 256 different values). The y-axis is log scale, so the difference is bigger than exponential.

Personally I find this question interesting, but the help center is clear: "You should only ask practical, answerable questions based on actual problems that you face". Now I'm curious: what is the actual problem here? — Kamil Maciorowski
– Kamil Maciorowski, Commented Nov 5, 2023 at 12:58
@KamilMaciorowski, I think that's mostly in reference to open-ended opinion-based questions and such. Even if oddities like this weren't that relevant to any single issue at hand, they may still have some clear technical reason in the background. And the answer may be interesting. — ilkkachu
– ilkkachu, Commented Nov 5, 2023 at 19:07
anyway, if we took that restriction strictly, pretty much all questions about history would be forbidden. ("why was this designed like this?" - "do you have a problem about that?") — ilkkachu
– ilkkachu, Commented Nov 5, 2023 at 19:08
@ilkkachu In good faith I assume all such questions are based on actual problems; and I assume the same here. The difference is here I'm more curious what the actual problem is; curious enough to ask. I suspect the answer to this (if my assumption is right and the answer exists) may be more interesting to me than any technical answer below. Of course it's Ole's courtesy if I get my answer. — Kamil Maciorowski
– Kamil Maciorowski, Commented Nov 5, 2023 at 19:38
@JustineKrejcha, heh, since you mention it, it's amusing to note that e.g. both codegolf and retrocomputing (rife with history questions) have the exact same text. So fixed boilerplate it indeed appears to be. — ilkkachu
– ilkkachu, Commented Nov 5, 2023 at 20:50

Stéphane Chazelas · Accepted Answer · 2023-11-08 10:56:31Z

That would be the 64 bit operand size prefix of amd64 machine code instructions.

You'll notice it only happens on amd64 executables.

If you compare on the /bin/* of http://ftp.debian.org/debian/pool/main/c/coreutils/coreutils_9.1-1_arm64.deb, http://ftp.debian.org/debian/pool/main/c/coreutils/coreutils_9.1-1_amd64.deb and http://ftp.debian.org/debian/pool/main/c/coreutils/coreutils_9.1-1_i386.deb, you'll see:

$ for f (coreutils_9.1-1_*.deb) bsdtar xOf $f da\* | bsdtar xO ./bin/\* | xxd -p -c1 | sort | uniq -c | sort -rn | head -n 5 | grep -H --label="${${f:r}##*_}" .
amd64: 692417 00
amd64: 145689 ff
amd64:  81911 48
amd64:  48006 89
amd64:  45331 0f
arm64:1409826 00
arm64:  70391 ff
arm64:  67915 03
arm64:  49380 20
arm64:  41655 40
i386: 515346 00
i386: 171643 ff
i386:  78361 0e
i386:  69317 24
i386:  50497 83

0x48 (72, 'H') is only in the top 3 on amd64.

On ls on my amd64 Debian system:

$ xxd -p -c1 =ls | sort | uniq -c | sort -rn | head -n 5
  39187 00
   7827 ff
   5565 48
   4181 20
   3393 0f

If we disassemble the code in that executable, we find a lot of 0x48 bytes in the instructions:

$ objdump -d =ls | grep -cw 48
5353

Most of them in first position:

$ objdump -d =ls | grep -wm10 48
    4000:       48 83 ec 08             sub    $0x8,%rsp
    4004:       48 8b 05 ad ff 01 00    mov    0x1ffad(%rip),%rax        # 23fb8 <__gmon_start__@Base>
    400b:       48 85 c0                test   %rax,%rax
    4012:       48 83 c4 08             add    $0x8,%rsp
    44b6:       68 48 00 00 00          push   $0x48
    4751:       48 89 f3                mov    %rsi,%rbx
    4754:       48 83 ec 68             sub    $0x68,%rsp
    4758:       48 8b 3e                mov    (%rsi),%rdi
    475b:       64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
    4764:       48 89 44 24 58          mov    %rax,0x58(%rsp)

$ objdump -d =ls | grep -Pc '^\s*[\da-f]+:\s+48'
5113

According to http://ref.x86asm.net/geek.html#x48, that 0x48 is the 64 Bit Operand Size REX.W opcode prefix which specify that the operation is to be made on 64 bit operands instead of whatever default it's meant to be.

$ objdump -d =ls | pcregrep -o1 -o2 '^\s*[\da-f]+:\s+(48 .. ).*?\t(\S+)' | sort | uniq -c  | sort -rn | head
   1512 48 89 mov
   1040 48 8b mov
    630 48 8d lea
    372 48 85 test
    326 48 83 add
    198 48 39 cmp
    158 48 83 sub
     79 48 01 add
     72 48 83 cmp
     69 48 c7 movq

All instructions done on 64 bit operands.

Wow that 0E as the third most common on i386 is baffling. As an opcode it's push cs (which is a 1 byte opcode). 24 is and, which is sensible, 83 is adc with a memory address and immediate operands, which is unexpected. — Joshua
– Joshua, Commented Nov 5, 2023 at 23:25
@Joshua 83 is any alu r/m32, imm8, with the operation depending on the ModRM byte that follows, it's not specifically adc. 24 is also a useful SIB byte (most memory operands that have esp as the base), probably more common in that usage than as and al, imm8 (which is not a useless operation either but using 8-bit registers in that way is not super common) — user555045
– user555045, Commented Nov 6, 2023 at 4:12
Just when I didn't think Intel instruction sets and encodings could be any stupider. — Kaz
– Kaz, Commented Nov 8, 2023 at 4:13

Stack Exchange Network

Why is 'H' / 72 / 0x48 the second most common byte in executables?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why is 'H' / 72 / 0x48 the second most common byte in executables?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions