To get the Unicode code point for characters other than the graphical ASCII ones (U+0021 (!) to U+007E (~); that is ASCII characters other than space and control ones), you could pipe the output of find -print0, or printf '%s\0' * or ls --zero to something like:
perl -Mcharnames=full -Mopen=locale -l -0pe '
s/[^!-~]/sprintf "<U+%04X %s>", ord($&), charnames::viacode(ord($&))/ge'
Example:
$ ls | cat
Stéphane
Stéphane
a
b
a b
abc
cba
cba
foo bar
abc
What's going on there? Using LC_ALL=C ls -1b gives you the byte values of the encoding of those characters:
$ LC_ALL=C ls -1b
Ste\314\201phane
St\303\251phane
a\nb
a\ b\
abc
cba
cb\342\200\213a
foo\343\200\200bar
\342\200\256abc
But unless you can decode UTF-8 in your head, that's not really helping figuring out what's going on.
$ ls --zero | perl -Mcharnames=full -Mopen=locale -l -0pe 's/[^!-~]/sprintf "<U+%04X %s>", ord($&), charnames::viacode(ord($&))/ge'
Ste<U+0301 COMBINING ACUTE ACCENT>phane
St<U+00E9 LATIN SMALL LETTER E WITH ACUTE>phane
a<U+000A LINE FEED>b
a<U+0020 SPACE>b<U+0020 SPACE>
abc
cba
cb<U+200B ZERO WIDTH SPACE>a
foo<U+3000 IDEOGRAPHIC SPACE>bar
<U+202E RIGHT-TO-LEFT OVERRIDE>abc
That explains why:
- the two
Stéphanes: one with pre-composed é, one with decomposed é.
- the 3
cbas, one with a zero width space, one that is abc after a right-to-left override.
With the zsh shell (and the extendedglob and cbases options enabled), you can reveal the wide character value¹ of characters other than !..~ in a string with something like:
${string//(#m)[^!-~]/<$(([#16] #MATCH))>}
For instance with a helper function:
reveal() {
set -o localoptions -o extendedglob -o cbases
print -rC1 -- ${@//(#m)[^!-~]/<$(([#16] #MATCH))>}
}
$ reveal *
Ste<0x301>phane
St<0xE9>phane
a<0xA>b
a<0x20>b<0x20>
abc
cba
cb<0x200B>a
foo<0x3000>bar
<0x202E>abc
In zsh, #var in an arithmetic expression yields the code of the first character in the value of $var. See also ##x for the character code of the literal x character. [#16] sets the output base, here 16 for hexadecimal.
$ echo $(( ##€ ))
8364
$ echo $(( [#16] ##€ ))
16#20AC
$ echo $(( [##16] ##€ ))
20AC
$ set -o cbases
$ echo $(( [#16] ##€ ))
0x20AC
The uconv utility from the International Components for Unicode (ICU) (in the icu-devtools package on Debian and derivatives), can also be useful to give various representations of Unicode characters, including:
- Unicode's
U+10FFFF (with Any-Hex/Unicode transliterator),
- C's (and zsh's and now most shells in their
$'...') \u00E9/\U0010FFFF (Any-Hex/C)
- perl's
\x{10FFFF} (Any-Hex/perl) or \N{LATIN SMALL LETTER E WITH ACUTE} (Any-Name).
- java/JSON's
\uDBFF\uDFFF (for U+10FFFF; Any-Hex/Java)
- XML's
 (Any-Hex/XML) or  (Any-Hex/XML10).
(the Any- prefix can be omitted).
Here, to apply to characters other than !...~ on a NUL-delimited list of file paths (also translating NUL to NL and \ to \\):
$ ls --zero | uconv -x '\0 > \n; \\ > \\\\; ( [^!-~] ) > &Any-Name($1)'
a\N{<control-000A>}b
a\N{SPACE}b\N{SPACE}
\N{RIGHT-TO-LEFT OVERRIDE}abc
abc
cb\N{ZERO WIDTH SPACE}a
cba
foo\N{IDEOGRAPHIC SPACE}bar
Ste\N{COMBINING ACUTE ACCENT}phane
St\N{LATIN SMALL LETTER E WITH ACUTE}phane
(which you could use in perl).
$ ls --zero | uconv -x '\0 > \n; \\ > \\\\; ( [^!-~] ) > &Any-Hex/C($1)'
a\u000Ab
a\u0020b\u0020
\u202Eabc
abc
cb\u200Ba
cba
foo\u3000bar
Ste\u0301phane
St\u00E9phane
Which you could use inside $'...' in many shells. Or pipe to:
sed "s/'/\\\\'/g; s/.*/\$'&'/"
To get those $'...' strings directly.
Example:
$ ls -1
'subdir 1'/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
subdir 1/
$ ls --zero | uconv -x '\0 > \n; \\ > \\\\; ( [^!-~] ) > &Any-Hex/C($1)' | sed "s/'/\\\\'/g; s/.*/\$'&'/"
$'subdir\u00201'
$'subdir\u20001'
$'subdir\u20011'
$'subdir\u20021'
$'subdir\u20031'
$'subdir\u20041'
$'subdir\u20051'
$'subdir\u20061'
$'subdir\u20081'
$'subdir\u20091'
$'subdir\u200A1'
$'subdir\u205F1'
$ ls -ld $'subdir\u20041'
drwxrwxr-x 1 chazelas chazelas 0 Oct 19 11:45 subdir 1/
Copy-pasting the one with the U+2004 THREE-PER-EM SPACE character:
$ LC_ALL=C ls -ld $'subdir\u20041'
drwxrwxr-x 1 chazelas chazelas 0 Oct 19 11:45 'subdir'$'\342\200\204''1'/
GNU ls in the C locale shows the value of the bytes of the UTF-8 encoding of that character in octal as they are outside the ASCII set so not considered as printable.
Note that while the perl or zsh approaches will decode the input as per the locale's character encoding, uconv assumes UTF-8 encoding regardless of the locale (though you can specify a different encoding with -f).
If there are bytes that can't be decoded into characters, perl handily represents them as \xHH, uconv report errors and zsh gives you the value of those bytes, so for instance give 0xE9 for both the U+00E9 character and 0xE9 bytes that are not part of valid characters.
¹ On GNU systems, that corresponds to the Unicode code point, but YMMV on other systems.
odorhexdumpfor, unless you have some restrictions on what format you want the output in. Does the question include how to convert the numeric representation of each character into a string as well?