4

Short version: I am at a bash command line. In some directory, I can ls various files and see their names. I would like to get something like the ASCII character code for each character in a file's name.

For example, if I see the file name file 1.txt, I would like Linux to report the internal representation of each character (e.g. the space is represented by this code) so there is no ambiguity about what is being shown.

Desired output:

> lsWithCodes
file[ASCII/UTF code for space, or similar].txt

Longer (not essential but perhaps entertaining) explanation: If I run ls, I see two subdirectories:

subdirectory 1/
subdirectory 2/

Both subdirectory names appear to contain spaces, but I suspect one of them is a different character. When I autocomplete, it completes up to subdirectory, not subdirectory . When I autocomplete from subdirectory\ , the only completion is subdirectory 1. In a program with a GUI, I can select and access subdirectory 2, but cd subdirectory\ 2/ returns an error that the directory does not exist.

So I suspect I am seeing something that appears exactly like a space but is not. I would like a way to determine exactly what character it is so I can access the directory in scripts.

5
  • 1
    This is usually something one could use od or hexdump for, unless you have some restrictions on what format you want the output in. Does the question include how to convert the numeric representation of each character into a string as well? Commented Oct 17 at 16:16
  • Thank you. I did not think ahead that far, but I think I can handle it. Or, at least, it would be a separate question someone has already asked. Commented Oct 17 at 16:31
  • please update the question with the expected output for your sample inputs (file names, dir names); also, do you need to worry about multi-byte characters and if so then consider adding an example to the question (for both sample input and expected output); without expected outputs we're left wondering about the format of the desired output (octal? hex? something else?) Commented Oct 17 at 16:55
  • Thanks. I've updated the question with desired output. I do not think I'm sophisticated enough to answer the rest of your question. I'm not sure what multi-byte characters are. As my response to the accepted answer says, I found out my problem was a no break space. Commented Oct 17 at 17:50
  • Good question... I've had similar questions, and I learned something here! Commented Oct 17 at 19:39

3 Answers 3

3

If you have a recent enough coreutils, you can use ls with the C locale:

% export LC_ALL=C.UTF-8
% ls -b
foo bar
% export LC_ALL=C
% ls -b
foo\343\200\200bar
% ls --quoting-style=c
"foo\343\200\200bar"

The above example was from an Arch Linux system (ls (GNU coreutils) 9.7).

Similarly on a macOS Tahoe 26.0.1 (25A362):

% /bin/ls -B
foo bar
% export LC_ALL=C
% /bin/ls -B
foo\343\200\200bar

Again on the Arch Linux system, using %q of various printfs to get shell-friendly representations:

% zsh -c 'printf "%q\n" foo*'
foo$'\302'$'\240'bar
% bash -c 'printf "%q\n" foo*'
$'foo\302\240bar'
% /bin/printf "%q\n" foo*
'foo'$'\302\240''bar'
4
  • This is perfect, thank you! Now I can see the character is \302\240, which I understand to be a no break space. Commented Oct 17 at 16:32
  • Ah. Depending on the shell you use for your script, that could be written as foo$'\302\240'bar. printf "%q\n" subdirectory* might help there. Commented Oct 17 at 16:41
  • Using foo$'\302\240'bar did the job for me. Thanks again! Commented Oct 17 at 17:25
  • Note you don't need the -b flag with GNU ls. You can just set LC_ALL=C and it will output the shell representation by default. I.e. 'foo'$'\302\240''bar' in your example. The idea is you can always copy and paste the output from ls, and with LC_ALL=C it will be the most compatible output Commented Oct 18 at 11:39
1

To get the Unicode code point for characters other than the graphical ASCII ones (U+0021 (!) to U+007E (~); that is ASCII characters other than space and control ones), you could pipe the output of find -print0, or printf '%s\0' * or ls --zero to something like:

perl -Mcharnames=full -Mopen=locale -l -0pe '
   s/[^!-~]/sprintf "<U+%04X %s>", ord($&), charnames::viacode(ord($&))/ge'

Example:

$ ls | cat
Stéphane
Stéphane
a
b
a b 
abc
cba
cb​a
foo bar
‮abc

What's going on there? Using LC_ALL=C ls -1b gives you the byte values of the encoding of those characters:

$ LC_ALL=C ls -1b
Ste\314\201phane
St\303\251phane
a\nb
a\ b\
abc
cba
cb\342\200\213a
foo\343\200\200bar
\342\200\256abc

But unless you can decode UTF-8 in your head, that's not really helping figuring out what's going on.

$ ls --zero | perl -Mcharnames=full -Mopen=locale -l -0pe 's/[^!-~]/sprintf "<U+%04X %s>", ord($&), charnames::viacode(ord($&))/ge'
Ste<U+0301 COMBINING ACUTE ACCENT>phane
St<U+00E9 LATIN SMALL LETTER E WITH ACUTE>phane
a<U+000A LINE FEED>b
a<U+0020 SPACE>b<U+0020 SPACE>
abc
cba
cb<U+200B ZERO WIDTH SPACE>a
foo<U+3000 IDEOGRAPHIC SPACE>bar
<U+202E RIGHT-TO-LEFT OVERRIDE>abc

That explains why:

  • the two Stéphanes: one with pre-composed é, one with decomposed é.
  • the 3 cbas, one with a zero width space, one that is abc after a right-to-left override.

With the zsh shell (and the extendedglob and cbases options enabled), you can reveal the wide character value¹ of characters other than !..~ in a string with something like:

${string//(#m)[^!-~]/<$(([#16] #MATCH))>}

For instance with a helper function:

reveal() {
  set -o localoptions -o extendedglob -o cbases
  print -rC1 -- ${@//(#m)[^!-~]/<$(([#16] #MATCH))>}
}
$ reveal *
Ste<0x301>phane
St<0xE9>phane
a<0xA>b
a<0x20>b<0x20>
abc
cba
cb<0x200B>a
foo<0x3000>bar
<0x202E>abc

In zsh, #var in an arithmetic expression yields the code of the first character in the value of $var. See also ##x for the character code of the literal x character. [#16] sets the output base, here 16 for hexadecimal.

$ echo $(( ##€ ))
8364
$ echo $(( [#16] ##€ ))
16#20AC
$ echo $(( [##16] ##€ ))
20AC
$ set -o cbases
$ echo $(( [#16] ##€ ))
0x20AC

The uconv utility from the International Components for Unicode (ICU) (in the icu-devtools package on Debian and derivatives), can also be useful to give various representations of Unicode characters, including:

  • Unicode's U+10FFFF (with Any-Hex/Unicode transliterator),
  • C's (and zsh's and now most shells in their $'...') \u00E9/\U0010FFFF (Any-Hex/C)
  • perl's \x{10FFFF} (Any-Hex/perl) or \N{LATIN SMALL LETTER E WITH ACUTE} (Any-Name).
  • java/JSON's \uDBFF\uDFFF (for U+10FFFF; Any-Hex/Java)
  • XML's &#x10FFFF; (Any-Hex/XML) or &#1114111; (Any-Hex/XML10).

(the Any- prefix can be omitted).

Here, to apply to characters other than !...~ on a NUL-delimited list of file paths (also translating NUL to NL and \ to \\):

$ ls --zero | uconv -x '\0 > \n; \\ > \\\\; ( [^!-~] ) > &Any-Name($1)'
a\N{<control-000A>}b
a\N{SPACE}b\N{SPACE}
\N{RIGHT-TO-LEFT OVERRIDE}abc
abc
cb\N{ZERO WIDTH SPACE}a
cba
foo\N{IDEOGRAPHIC SPACE}bar
Ste\N{COMBINING ACUTE ACCENT}phane
St\N{LATIN SMALL LETTER E WITH ACUTE}phane

(which you could use in perl).

$ ls --zero | uconv -x '\0 > \n; \\ > \\\\; ( [^!-~] ) > &Any-Hex/C($1)'
a\u000Ab
a\u0020b\u0020
\u202Eabc
abc
cb\u200Ba
cba
foo\u3000bar
Ste\u0301phane
St\u00E9phane

Which you could use inside $'...' in many shells. Or pipe to:

sed "s/'/\\\\'/g; s/.*/\$'&'/"

To get those $'...' strings directly.

Example:

$ ls -1
'subdir 1'/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
$ ls --zero | uconv -x '\0 > \n; \\ > \\\\; ( [^!-~] ) > &Any-Hex/C($1)' | sed "s/'/\\\\'/g; s/.*/\$'&'/"
$'subdir\u00201'
$'subdir\u20001'
$'subdir\u20011'
$'subdir\u20021'
$'subdir\u20031'
$'subdir\u20041'
$'subdir\u20051'
$'subdir\u20061'
$'subdir\u20081'
$'subdir\u20091'
$'subdir\u200A1'
$'subdir\u205F1'
$ ls -ld $'subdir\u20041'
drwxrwxr-x 1 chazelas chazelas 0 Oct 19 11:45 subdir 1/

Copy-pasting the one with the U+2004 THREE-PER-EM SPACE character:

$ LC_ALL=C ls -ld $'subdir\u20041'
drwxrwxr-x 1 chazelas chazelas 0 Oct 19 11:45 'subdir'$'\342\200\204''1'/

GNU ls in the C locale shows the value of the bytes of the UTF-8 encoding of that character in octal as they are outside the ASCII set so not considered as printable.

Note that while the perl or zsh approaches will decode the input as per the locale's character encoding, uconv assumes UTF-8 encoding regardless of the locale (though you can specify a different encoding with -f).

If there are bytes that can't be decoded into characters, perl handily represents them as \xHH, uconv report errors and zsh gives you the value of those bytes, so for instance give 0xE9 for both the U+00E9 character and 0xE9 bytes that are not part of valid characters.


¹ On GNU systems, that corresponds to the Unicode code point, but YMMV on other systems.

0

Most shells have a completion mode in which you can cycle through completions.

For instance with zsh and complist enabled:

screencast showing walking through completions

Above, pressing Tab and navigating with arrow keys.

That allows you to select the various different instances of subdir 1, but not see the differences between them.

In my example, those spaces are non-ASCII whitespace characters encoded in UTF-8.

In the C locale, all the bytes in the UTF-8 encoding of those end up corresponding to underfined characters, so like GNU ls, zsh completion renders it using some \ooo representation of each byte:

screencast showing walking through completions in C locale

That means we see the differences at byte level, but not really explain why they display the same.

Ideally, here, we'd want the completion to show us a description of those characters like with the uconv -x any-name of my other answer.

To do that, we could hijack the _list_files function which is otherwise used to show a long listing when the file-list zstyle is enabled:

screencast showing default _list_files behaviour

If we replace it with:

_list_files() {
  (( NUMERIC )) || return
  local -a files=(${(P)1})
  (( $#files )) || return
  listfiles=(
    ${(0)"$(
      print -rN -- ${(Q)files} |
        uconv -x '( [^!-~\0] ) > &Name($1)'
    )"}
  )
  files=( $files:gs/\\/\\\\/:gs/:/\\: )
  printf -v listfiles %s:%s ${files:^listfiles}
  zformat -a listfiles " -- " $listfiles
  listopts=(-d listfiles -l -o match)
}

Then invoking the completion with a NUMERIC argument, like with Alt+1, Tab shows something like:

screencast with modified _list_files

Where we see those space characters in those various subdir 1 directories are actually EN QUAD, EM QUAD, EN SPACE typographical spacing characters and not plain U+0020 ASCII space characters.

A minimum ~/.zshrc to achieve that would look like:

zstyle ':completion:*' format 'Completing %d'
zstyle ':completion:*' group-name ''
eval "$(dircolors -b)"
zstyle ':completion:*' list-colors ${(s.:.)LS_COLORS}
zstyle ':completion:*' menu select=2
autoload -Uz compinit
compinit -i
_list_files() {
  (( NUMERIC )) || return
  local -a files=(${(P)1})
  (( $#files )) || return
  listfiles=(
    ${(0)"$(
      print -rN -- ${(Q)files} |
        uconv -x '( [^!-~\0] ) > &Name($1)'
    )"}
  )
  files=( $files:gs/\\/\\\\/:gs/:/\\: )
  printf -v listfiles %s:%s ${files:^listfiles}
  zformat -a listfiles " -- " $listfiles
  listopts=(-d listfiles -l -o match)
}

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.