6

There seems to be limited/inconsistent support for unusual but legal characters in zsh (and sh, bash) shell variable names on mac. Is there any way to fix this for full or better support? Perhaps this is better at https://apple.stackexchange.com, but not sure.

This question shows support for

# unicode value
BANANAS=バナナス

# unicode name
バナナス=BANANAS

on some linux OS, but in mac zsh I get a failure after # unicode name.

zsh: command not found: バナナス=BANANAS

However, below works with no errors.

# other non ascii chars
çølór=purple

# command names
alias 말=echo

バナナス() {
  말 "turn $1 into a banana"
}

I can also create such env variables in an app like a python script that I cannot create within the shell directly.

import os
os.environ['バナナス'] = 'BANANAS'
print(f"$バナナス={os.environ['バナナス']}")
# $バナナス=BANANAS

Environment

key val
OS macOS 10.15.7
zsh --version zsh 5.7.1 (x86_64-apple-darwin19.0)
zsh 5.9 (x86_64-apple-darwin19.6.0)
$LANG ko_KR.UTF-8
setopt combiningchars interactive login monitor shinstdin zle
clang --version 11.0.0 (clang-1100.0.33.17)12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin19.6.0
New contributor
owengall is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
3
  • 1
    I'm using en_GB.UTF-8 on 26.0.1 (25A362) (zsh: 5.9 (arm-apple-darwin22.1.0)) and バナナス=BANANAS, echo $バナナス works fine Commented yesterday
  • I tried upgrading shell to zsh 5.9 (x86_64-apple-darwin19.6.0), but still seeing same behavior. Commented yesterday
  • It would be helpful to know oldest macOS version (corresponding to oldest clib version) on which wide characters like are correctly classified with iswalpha. I'm guessing that can be answered by looking at source files near libc/include/_ctype.h. Commented 2 hours ago

1 Answer 1

7

Zsh variable names have to be made of alphanumeric characters only, and the first one can't be an ASCII digit which is reserved for position parameters. When the posixidentifiers option is enabled (like in sh or ksh emulation), that's restricted to ASCII ones.

So you need a locale where iswalpha() returns true for and iswalnum() returns true for and .

Those are functions from the C library zsh is linked against (generally the system's one) that use character classification information from the system's locale (as determined by the LC_ALL, LC_CTYPE and LANG environment variable).

if [[ 'バ' = [[:alpha:]] ]] or print -r -- 'バ' | grep -xq '[[:alpha:]]' don't succeed in your locale, then you can't use that as a variable name.

On a GNU system:

$ print -r -- 'バナナス' | LC_ALL=C.UTF-8 grep -o '[[:alpha:]]'
バ
ナ
ナ
ス

They're all classified as alpha (a subset of alnum), even in the C.UTF-8 locale.

Note that alpha doesn't imply only letters in alphabetical scripts. Per ISO/IEC TR 14652 at least, that's characters to be classified as used to spell out the words for natural languages; such as letters, syllabic or ideographic characters.

So that (U+30D0 KATAKANA LETTER BA) should be classified as alpha. It is definitely classified as letter by Unicode.

Note that on most systems, environment variable names as opposed to the variables of many languages including shells can contain any sequence of bytes except for 0 and the encoding of =, but beware some tools such as some shells can remove those they don't like.

For instance, mksh removes all those it can't map to shell variables, which for that shell is limited to ASCII alnum and underscores. It will even remove bash's exported functions which since shellshock have names like BASH_FUNC_funcname%%.

So, in general it's a bad idea to have shell variable names exported to the environment whose name contains characters other than ASCII letters, digits and underscores.

Also, while characters in the ASCII set (Unicode characters U+0000 to U+007F) have an encoding that is invariant across locales on most systems¹, it's not the case for the other ones (what I think you meant by unicode characters), so you may find that if your script contains:

バナナス=BANANAS

It may be treated as a variable assignment in one locale but invoke a バナナス=BANANAS command in another.

So I would also advise not to use variable names with non-ASCII characters, even if you don't export them to the environment.

For reference, in the rc shell, variables can have any name (and you can even assign to that with the empty name in the original Plan9 implementation), and they're all exported to the environment.

In the original one, they're exported as-is (which causes problems at least on Unix for the ones whose name contains =) while with the public domain clone by Byron Rakitzis and derivatives, they're encoded there using ASCII only alnums and underscores:

; '++' = zzz
; 'バナナス' = zzz
; env | grep zzz
__2b__2b=zzz
__e3__83__90__e3__83__8a__e3__83__8a__e3__82__b9=zzz

Which of course only other instances of rc or derivatives executed in that environment decode back into the original variable names.

Functions are another matter. Some shells have the same restrictions on their name as variable names (when functions were added to the Bourne shell, they shared the same namespace, you couldn't have both a variable and function by the same name), some like bash allow a few extra characters, but that's rather confusing and unnecessary restrictions.

Functions share the namespace of command arguments, so it would seem normal they can have the same values as those. In zsh, a function name can be any sequence of bytes, including 0 (which zsh allows in command arguments, though that won't work for external commands because of a limitation of the execve() system call), including empty regardless of whether they form part of any character or not in any locale.

$ ''() echo empty
$ $'\0'() echo NUL
$ $'\xde\xad\xbe\xef'() echo Dead Beef
$ $'\xDE\xAD\xBE\xEF'
Dead Beef
$ $'\u0000'
NUL
$ ""
empty
6
  • Since char classifier methods like iswalpha are defined in the standard C library, perhaps I can pursue updating my C compiler version? Wondering whether updating mac cli dev tools with xcode-select could achieve this. Commented 23 hours ago
  • 3
    @owengall, the libc and the compiler are generally not tightly coupled -- no matter which compiler you use it's generally linking against the operating system vendor's provided libc. (There are exceptions and nuances, but they don't much apply on MacOS). Commented 22 hours ago
  • Indeed, no change after updating C compiler to 12.0.0 (clang-1200.0.32.29). Commented 2 hours ago
  • As of now, this doesn't seem feasible for my OS version, since zsh is defining valid identifier chars with C library methods like iswalpha, which are built into OS, and I doubt another locale exists with better support than ko_KR.UTF-8 or ja_JP.UTF-8 for these wide characters. I plan to close as resolved after running a compiled c/cpp program that calls iswalpha directly. Commented 2 hours ago
  • @owengall, just try print -r -- 'バナナス' | grep -x '[[:alpha:]][[:alnum:]_]*', no need to compile a C program. grep should also use the same locale information (even if not directly using iswalpha()). Commented 2 hours ago

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.