60
votes
Accepted
Why did {} start appearing as äå in Terminal.app?
I can reproduce it with the xterm terminal emulator (version 366), if I do:
$ printf '\e[?42h\e(H'; cat chars.txt; printf '\e(B\e[?42l'
!É#$%Ü&*()_+äå
Where:
\e[?42h. Enables National ...
27
votes
Accepted
Strange character in a file
This file contains bytes C2 96, which are the UTF-8 encoding of codepoint U+0096. That codepoint is one of the C1 control characters commonly called SPA "Start of Guarded Area" (or "Protected Area"). ...
26
votes
Accepted
UTF-8 characters in POSIX shell script *comments* - anything against it?
POSIX specifies how tokens should be recognised, including comments:
If the current character is a '#', it and all subsequent characters up to, but excluding, the next <newline> shall be ...
17
votes
How can I correctly decompress a ZIP archive of files with Hebrew names?
I had success with the command 7z x <source.zip>.
Version:
p7zip Version 16.02 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,[...])
Potentially relevant environment:
LANG=en_US.UTF-8
LC_ALL=...
13
votes
Can not use `cut -c` (`--characters`) with UTF-8?
colrm (part of util-linux, should be already installed on most distributions) seems to handle internationalization much better :
$ echo 'αβγ' | colrm 3
αβ
$ echo 'αβγ' | colrm 2
α
Beware of the ...
13
votes
Why do some characters show as squares in Chrome?
There's a better way to determine what font you're missing instead of blindly installing font packages.
For example I did the following to resolve missing fonts:
I received an email with two unknown ...
12
votes
Accepted
How to translate Unicode characters?
Both GNU and BSD sed are multibyte-aware in appropriate locales, and the y command is analogous to tr:
$ echo hello | sed -e 'y/abcdefghijklmnopqrstuvwxyz/abcdefghijklmnopqrstuvwxyz/'
hello
This ...
12
votes
Accepted
Problems with UTF-8 when attaching to a tmux session over ssh
You're checking the locale settings inside the tmux sessions, but not those that tmux itself receives.
server.ltd likely doesn't have AcceptEnv LANG LC_* in its sshd_config or/and you don't have ...
12
votes
UTF-8 characters in POSIX shell script *comments* - anything against it?
The accepted answer is fine, but let me explain the same with a slightly different angle:
POSIX is very exact and complete in its handling of character encodings. That is, any conceivable effect of ...
12
votes
Accepted
Revert filenames after they were garbled by using different encoding
Those look like file names that were initially encoded in CP866 but were incorrectly converted to UTF-8 assuming they were encoded in MAC-CYRILLIC instead.
$ echo СМП структура | iconv -t cp866 | # ...
11
votes
Accepted
View file names in hex?
Pipe the file names to od or a similar tool:
printf '%s\n' * | od -t x1 -a
$ ls
Accentué bar foo
$ printf '%s\n' * | od -t x1 -a
0000000 41 63 63 65 6e 74 75 c3 a9 0a 62 61 72 0a 66 ...
10
votes
How to print a variable that contains unprintable characters?
Some various approaches at giving visual representations of strings:
POSIX
$ printf %s "$IFS" | od -vtc -to1
0000000 \t \n \0
040 011 012 000
0000004
$ printf '%s\n' "$IFS" | LC_ALL=C ...
10
votes
Accepted
What's the right way to base64 encode a binary file on CentOS 7?
$ echo foo |base64
Zm9vCg==
$ echo foo |base64 |wc -c
9
Note the trailing newline in the output of base64, it's the ninth character here.
For longer input, it'll produce more than one line, as it ...
10
votes
How is data encoded in pipes/STDOUT/STDIN?
I'll address each of your points below:
Pipes deal with binary, and are agnostic to the encoding
Correct.
Applications on each side of the pipe (including STDOUT/STDIN) should have consensus on the ...
10
votes
What is the difference between a byte and a character (at least *nixwise)?
POSIXly, emphasis mine:
3.87 Character
A sequence of one or more bytes representing a single graphic symbol or control code.
In practice, the exact meaning depends on the locale in effect, e.g. ...
10
votes
What is "modifier" in locale name?
There is no single unified meaning for the modifier. For example, in the early 2000s, when parts of the EU transitioned from their own national currencies to the Euro, the @euro modifier was used to ...
9
votes
Why do some characters show as squares in Chrome?
installing the noto font from google, did it for me.
yay -S noto-fonts
Now, reload the font cache:
fc-cache -vf
9
votes
Accepted
Unexpected non-null encoding of /proc/<pid>/cmdline
at least one uses spaces for delimiters
Incorrect.
If you look at the end of the pseudo-file on FreeBSD/TrueOS, where you can encounter exactly the same behaviour with Chromium, you will find a ...
9
votes
Accepted
What is `â<80><98>` and how to avoid it?
Your distribution uses UTF-8 character encoding. This is normal for most current distributions.
What you see is the effect of UTF-8 coded characters displayed as another encoding.
Many GNU utilities ...
8
votes
How can I correctly decompress a ZIP archive of files with Hebrew names?
I have just had the same problem, and it turns out that my version of unzip that is available from Ubuntu repositories (UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.) can handle ...
8
votes
Accepted
Bash convert \xC3\x89 to É?
Hexdecimal numeric constants are usually represented with 0x prefix.Character and string constants may express character codes in hexadecimal with the prefix \x followed by two hex digits.
echo -ne '\...
8
votes
Generate collating order of a list of individual characters
There are several aspects to that. We need to list all the characters in the locale's charset, select the graphical ones (like your 33 to 126 ASCII ones) and sort them.
And there's also the question ...
7
votes
How to print a variable that contains unprintable characters?
Especially with IFS, you absolutely want to quote it, since otherwise it, well turns to nothing. You did that already, so no problem there.
As for echo, it depends on the shell. Some versions of echo ...
7
votes
Accepted
How do I properly convert the file to UTF-16LE encoding without strange characters appearing in the file?
Your vim hasn't recognised the encoding, and is showing the 16-bit characters as 8-bit characters. The ^@ markers represent the higher order 8-bits, which for common Latin characters will be zero ...
7
votes
What is "modifier" in locale name?
The @modifier setting specifies a variant. A minor addition in the encoding set. As an example :
European countries have long time relied on ISO definitions.
Some French, for example (language fr, ...
7
votes
Accepted
Collect chars from strings and print their unicode
With perl:
perl -C -lne '
if (/=(.*)/) {$c{$_}++ for split //, $1}
END{print join ",", map {sprintf "0x%X", ord$_} sort keys %c}
' your-file
Gives:
0x42,0x46,0x61,0x63,0x64,...
7
votes
Accepted
Converting from ISO-IR-87 to UTF-8 encoding
GNU recode seems to support it:
$ recode -l | grep -i ISO-IR-87
JIS_X0208 csISO87JISX0208 ISO-IR-87 JIS0208 JISX0208.1983-0 JISX0208.1990-0 JIS_X0208-1983 JIS_X0208-1990 X0208
So:
recode ISO-IR-87.....
7
votes
Accepted
Allow unicode characters in zsh shell variable names on MacOS
Zsh variable names have to be made of alphanumeric characters only, and the first one can't be an ASCII digit which is reserved for position parameters. When the posixidentifiers option is enabled (...
6
votes
Accepted
How to make `less` understand codepage?
Running less as
LC_ALL=ru_RU.CP1251 less file
provided that ru_RU.CP1251 locale exists on your system (see if LC_ALL=ru_RU.CP1251 locale charmap returns CP1252) tells less that you are in that locale,...
Only top scored, non community-wiki answers of a minimum length are eligible
Related Tags
character-encoding × 425unicode × 70
text-processing × 44
locale × 43
terminal × 36
linux × 35
bash × 27
shell × 25
filenames × 22
special-characters × 20
ubuntu × 17
shell-script × 16
debian × 16
command-line × 16
ascii × 16
sed × 15
files × 15
text × 14
conversion × 14
grep × 13
vim × 12
macos × 12
fonts × 11
ssh × 10
arch-linux × 7