The command cut has an option -c to work on characters, instead of bytes with the option -b. But that does not seem to work, in en_US.UTF-8 locale:
The second byte gives the second ASCII character (which is encoded just the same in UTF-8):
$ printf 'ABC' | cut -b 2
B
but does not give the second of three greek non-ASCII characters in UTF-8 locale:
$ printf 'αβγ' | cut -b 2
�
That's alright - it's the second byte.
So we look at the second character instead:
$ printf 'αβγ' | cut -c 2
�
That looks broken.
With some experiments, it turns out that the range 3-4 shows the second character:
$ printf 'αβγ' | cut -c 3-4
β
But that's just the same as the bytes 3 to 4:
$ printf 'αβγ' | cut -b 3-4
β
So the -c does not more than the -b for UTF-8.
I'd expect the locale setup is not right for UTF-8, but in comparison, wc works as expected;
It is often used to count bytes, with option -c (--bytes).
(Note the confusing option names.)
$ printf 'αβγ' | wc -c
6
But it can also count characters with option -m (--chars), which just works:
$ printf 'αβγ' | wc -m
3
So my configuration seems to be ok - but something is special about cut.
Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b and -c.
So, what's wrong? And why?
The locale setup looks right for utf8, as far as I can tell:
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
The input, byte by byte:
$ printf 'αβγ' | hd
00000000 ce b1 ce b2 ce b3 |......|
00000006
-cis using the same code as-b. Did you have a look at the source code? Maybe you can find a hint what-cis actually meant for.