6

Because that's what some of them are doing.

> echo echo Hallo, Baby! | iconv -f utf-8 -t utf-16le > /tmp/hallo
> chmod 755 /tmp/hallo
> dash /tmp/hallo
Hallo, Baby!
> bash /tmp/hallo
/tmp/hallo: /tmp/hallo: cannot execute binary file
> (echo '#'; echo echo Hallo, Baby! | iconv -f utf-8 -t utf-16le) > /tmp/hallo
> bash /tmp/hallo
Hallo, Baby!
> mksh /tmp/hallo
Hallo, Baby!
> cat -v /tmp/hallo
#
e^@c^@h^@o^@ ^@H^@a^@l^@l^@o^@,^@ ^@B^@a^@b^@y^@!^@
^@

Is this some compatibility nuisance actually required by the standard? Because it looks quite dangerous and unexpected.

12
  • 1
    The standard doesn't allow NULs in scripts; see here, and here. Commented Nov 26, 2019 at 8:04
  • 1
    I also didn't understand why this is quite dangerous Commented Nov 26, 2019 at 10:14
  • 2
    In using the phrase "NUL bytes" this question is conflating two different things. NUL is a character name. The text files in question have characters whose multiple-byte encodings contain bytes with the value zero, but those are not NUL characters in UTF16; and the text files in question contain no NUL characters at all. Better questions would be whether this behaviour is conformant in the "POSIX" locale, what locales in practice allow text files to be encoded as UTF16, and why cat -v is not showing the zero bytes after the 0x23 and 0x0A bytes in the first line. Commented Nov 26, 2019 at 15:39
  • 1
    @JdeBP no, it has nothing to do with locales, utf-16 or unicode; I've used iconv just for convenience; the shell will ignore any number of NUL bytes: (printf ec; dd if=/dev/zero count=1024; echo ho doh) | dash. The misleading [unicode] tag wasn't by me. Commented Nov 27, 2019 at 0:41
  • 1
    @oguzismail if you're searching for evil_shit in some scripts, you'll have to search for e\0*v\0*i\0*l\0*_\0*s\0*h\0*i\0*t\0* instead. And that's just one way of looking at it. Commented Nov 27, 2019 at 6:20

2 Answers 2

11

As per POSIX,

input file shall be a text file, except that line lengths shall be unlimited¹

NUL characters² in the input make it non-text, so the behaviour is unspecified as far as POSIX is concerned, so sh implementations can do whatever they want (and a POSIX compliant script must not contain NULs).

There are some shells that scan the first few bytes for 0s and refuse to run the script on the assumption that you tried to execute a non-script file by mistake.

That's useful because the exec*p() functions, env commands, sh, find -exec... are required to call a shell to interpret a command if the system returns with ENOEXEC upon execve(), so, if you try to execute a command for the wrong architecture, it's better to get a won't execute a binary file error from your shell than the shell trying to make sense of it as a shell script.

That is allowed by POSIX:

If the executable file is not a text file, the shell may bypass this command execution.

Which in the next revision of the standard will be changed to:

The shell may apply a heuristic check to determine if the file to be executed could be a script and may bypass this command execution if it determines that the file cannot be a script. In this case, it shall write an error message, and shall return an exit status of 126.
Note: A common heuristic for rejecting files that cannot be a script is locating a NUL byte prior to a <newline> byte within a fixed-length prefix of the file. Since sh is required to accept input files with unlimited line lengths, the heuristic check cannot be based on line length.

That behaviour can get in the way of shell self-extractable archives though which contain a shell header followed by binary data¹.

The zsh shell supports NUL in its input, though note that NULs can't be passed in the arguments of execve(), so you can only use it in the argument or names of builtin commands or functions:

$ printf '\0() echo zero; \0\necho \0\n' | zsh | hd
00000000  7a 65 72 6f 0a 00 0a                              |zero...|
00000007

(here defining and calling a function with NUL as its name and passing a NUL character as argument to the builtin echo command).

Some will strip them which is also a sensible thing to do. NULs are sometimes used as padding. They are ignored by terminals for instance (they were sometimes sent to terminals to give them time to process complex control sequences (like carriage return (literally)). Holes in files appear as being filled with NULs, etc.

Note that non-text is not limited to NUL bytes. It's also sequence of bytes that don't form valid characters in the locale. For instance, the 0xc1 byte value cannot occur in UTF-8 encoded text. So in locales using UTF-8 as the character encoding, a file that contains such a byte is not a valid text file and therefore not a valid sh script³.

In practice, yash is the only shell I know that will complain about such invalid input.


¹ In the next revision of the standard, it is going to change to

The input file may be of any type, but the initial portion of the file intended to be parsed according to the shell grammar (XREF to XSH 2.10.2 Shell Grammar Rules) shall consist of characters and shall not contain the NUL character. The shell shall not enforce any line length limits.

explicitly requiring shells to support input that starts with a syntactically valid section without NUL bytes, even if the rest contains NULs, to account for self-extracting archives.

² and characters are meant to be decoded as per the locale's character encoding (see the output of locale charmap), and on POSIX system, the NUL character (whose encoding is always byte 0) is the only character whose encoding contains the byte 0. In other words, UTF-16 is not among the character encodings that can be used in a POSIX locale.

³ There is however the question of the locale changing within the script (like when the LANG/LC_CTYPE/LC_ALL/LOCPATH variables are assigned) and at which point the change takes effect for the shell interpreting the input.

10
  • Hmm, I wonder why its 'NUL byte prior to a newline'? Unless it's exactly for self-extracting files, which probably have a newline first and the binary data only after that. Commented Nov 26, 2019 at 12:11
  • 1
    @ilkkachu, yes see edit. Commented Nov 26, 2019 at 13:34
  • See my answer, the behavior is not related to POSIX but rather to an implementation detail. Commented Nov 26, 2019 at 14:20
  • 1
    @JdeBP, on POSIX systems no system locale can use a charset where characters other than NUL have byte 0 in their encoding, UTF-16 cannot be used as a POSIX locale charset. So here, byte 0 or NUL character is the same thing, though I agree that the mention of UTF16 is bringing some confusion in this Q&A. Commented Nov 26, 2019 at 16:50
  • 1
    @JdeBP, pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/… requires NUL to be encoded as a single byte 0, and all characters from the portable charset to be encoded as a positive char value. POSIX bytes are 8 bit (pubs.opengroup.org/onlinepubs/9699919799.2018edition/basedefs/…). That precludes UTF-16. There may not be explicit text that says some other multibyte characters can't contain 0, but such a charset would be impractical as for instance, you couldn't pass strings encoded in that charset to execve() or most of the Unix API. Commented Nov 27, 2019 at 9:21
-1

The reason for this behavior is a bit complex...

First, modern shells include a check for potentially binary files (that contain null bytes), but this check only verifies the first line from the file. This is why the '#' in the first line changes behavior. The historical Bourne Shell does not have that binary check and does not even need the '#' to behave the way you mentioned.

Then the specific method used by the Bourne Shell to support multi byte characters via mbtowc() simply skips all null bytes because mbtowc() returns the character length 0 for a null byte and this causes a loop to retry the next character.

The Bourne Shell introduced this kind of code around 1988 and it may be that other shells copied the behavior.

5
  • That can't apply to dash though as dash is not multi-byte aware. Commented Nov 26, 2019 at 14:26
  • You are correct, but this is the reason why Bourne Shell and ksh88 work this way. Commented Nov 26, 2019 at 14:41
  • What we did intend with the new wording in POSIX is to permit binary content in a shell script in order to be able to implement self extracting scripts that contain e.g. a compressed TAR archive at the end. Commented Nov 26, 2019 at 14:44
  • ksh93 may check more than the first line, like when the first line contains an unterminated statement (like a line containing (). Note that some more modern shells like fish, es, zsh don't do that check (zsh can work with NULs) Commented Nov 26, 2019 at 14:44
  • FWIW, it seems true that this exact behaviour was introduced by the original bourne shell (the pre-bourne shell won't allow NUL bytes within words; I have just tried both with the apout pdp11 user-land simulator). It's not at all clear that it was intentional, though ;-) Commented Nov 27, 2019 at 1:52

You must log in to answer this question.