Can not use `cut -c` (`--characters`) with UTF-8?

Question

The command cut has an option -c to work on characters, instead of bytes with the option -b. But that does not seem to work, in en_US.UTF-8 locale:

The second byte gives the second ASCII character (which is encoded just the same in UTF-8):

$ printf 'ABC' | cut -b 2          
B

but does not give the second of three greek non-ASCII characters in UTF-8 locale:

$ printf 'αβγ' | cut -b 2         
�

That's alright - it's the second byte.
So we look at the second character instead:

$ printf 'αβγ' | cut -c 2 
�

That looks broken.
With some experiments, it turns out that the range 3-4 shows the second character:

$ printf 'αβγ' | cut -c 3-4
β

But that's just the same as the bytes 3 to 4:

$ printf 'αβγ' | cut -b 3-4
β

So the -c does not more than the -b for UTF-8.

I'd expect the locale setup is not right for UTF-8, but in comparison, wc works as expected;
It is often used to count bytes, with option -c (--bytes). ^{(Note the confusing option names.)}

$ printf 'αβγ' | wc -c
6

But it can also count characters with option -m (--chars), which just works:

$ printf 'αβγ' | wc -m
3

So my configuration seems to be ok - but something is special about cut.

Maybe it does not support UTF-8 at all? But it does seem to support multi-byte characters, otherwise it would not need to support -b and -c.

So, what's wrong? And why?

The locale setup looks right for utf8, as far as I can tell:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

The input, byte by byte:

$ printf 'αβγ' | hd 
00000000  ce b1 ce b2 ce b3                                 |......|
00000006

Interesting! It looks like -c is using the same code as -b. Did you have a look at the source code? Maybe you can find a hint what -c is actually meant for. — michas
– michas, Commented Oct 23, 2014 at 6:11

Michael Homer · Accepted Answer · 2014-10-23 06:28:27Z

22

You haven't said which cut you're using, but since you've mentioned the GNU long option --characters I'll assume it's that one. In that case, note this passage from info coreutils 'cut invocation':

‘-c character-list’
‘--characters=character-list’
Select for printing only the characters in positions listed in character-list. The same as -b for now, but internationalization will change that.

(emphasis added)

For the moment, GNU cut always works in terms of single-byte "characters", so the behaviour you see is expected.

Supporting both the -b and -c options is required by POSIX — they weren't added to GNU cut because it had multi-byte support and they worked properly, but to avoid giving errors on POSIX-compliant input. The same -c has been done in some other cut implementations, although not FreeBSD's and OS X's at least.

This is the historic behaviour of -c. -b was newly added to take over the byte role so that -c can work with multi-byte characters. Maybe in a few years it will work as desired consistently, although progress hasn't exactly been quick (it's been over a decade already). GNU cut doesn't even implement the -n option yet, even though it is orthogonal and intended to help the transition. There are potential compatibility problems with old scripts, which may be a concern, although I don't know definitively what the reason is.

edited Oct 23, 2014 at 6:28

answered Oct 23, 2014 at 6:16

Michael Homer

78.9k17 gold badges221 silver badges239 bronze badges

1

good work. youll find the same kind of comments in GNU's tr docs as well. and even tar unless i misremember. i guess its a big project.

mikeserv
– mikeserv

2014-10-23 08:42:45 +00:00
Commented Oct 23, 2014 at 8:42
Is there any workaround for unicode probelm in cut? For example, where is it possible to download the sources for patched cut? Or would it be more easier to use another utility? (grep solution below does not work smoothly with ranges e.g. 5-8,44-49)

dma_k
– dma_k

2018-01-31 00:11:25 +00:00
Commented Jan 31, 2018 at 0:11
see this 2017 article, sub-titled ”Random notes and pointers regarding the on-going effort to add multibyte and unicode support in GNU Coreutils“: crashcourse.housegordon.org/coreutils-multibyte-support.html

myrdd
– myrdd

2018-12-12 14:29:12 +00:00
Commented Dec 12, 2018 at 14:29
you can find some alternatives to cut -c here: superuser.com/questions/506164/…

myrdd
– myrdd

2018-12-12 14:32:04 +00:00
Commented Dec 12, 2018 at 14:32

Add a comment |

Skippy le Grand Gourou · Accepted Answer · 2019-03-22 14:13:07Z

13

colrm (part of util-linux, should be already installed on most distributions) seems to handle internationalization much better :

$ echo 'αβγ' | colrm 3
αβ
$ echo 'αβγ' | colrm 2
α

Beware of the numbering : colrm N will remove columns from N, printing characters up to N-1.

(credits)

answered Mar 22, 2019 at 14:13

Skippy le Grand Gourou

3,4832 gold badges32 silver badges39 bronze badges

colrm doesn't seem to handle emojis well: echo '😀removethis' | colrm 2 returns nothing for me.

frabjous
– frabjous

2022-06-13 14:43:13 +00:00
Commented Jun 13, 2022 at 14:43
@frabjous They seem to count for two characters, try echo '😀removethis' | colrm 3. ;)

Skippy le Grand Gourou
– Skippy le Grand Gourou

2022-06-13 15:48:07 +00:00
Commented Jun 13, 2022 at 15:48
1

@SkippyleGrandGourou no that's wrong. UTF-8, UTF-16 and UTF-32 are just different encodings of Unicode, and all can represent characters up to U+10FFFF?. Characters outside the BMP are represented by 4 bytes in both UTF-8 and UTF-16

phuclv
– phuclv

2023-07-11 03:51:51 +00:00
Commented Jul 11, 2023 at 3:51
1

@phuclv Right, comment removed. Please keep yours as it’s informative (hopefully readers will understand it refers to a deleted comment and not to the answer…).

Skippy le Grand Gourou
– Skippy le Grand Gourou

2023-07-11 13:30:10 +00:00
Commented Jul 11, 2023 at 13:30

Add a comment |

Royce Williams · Accepted Answer · 2023-07-10 15:13:27Z

8

Since many grep implementations are multibyte-aware, you can also use grep -o to simulate some uses of cut -c.

First two characters:

$ echo Τηεοδ29 | grep -o '^..'
Τη

Last three characters:

$ echo Τηεοδ29 | grep -o '...$'
δ29

Second character:

$ echo Τηεοδ29 | grep -o '^..' | grep -o '.$'
η

Adjust the number of periods, or use {x,y} syntax, to simulate cut ranges.

edited Jul 10, 2023 at 15:13

answered Aug 20, 2016 at 14:48

Royce Williams

1,24911 silver badges20 bronze badges

1

no need for such complex solutions to get the second character. echo Τηεοδ29 | grep -Po '(?<=^.).' or echo Τηεοδ29 | grep -Po '^.\K.' will suffice

phuclv
– phuclv

2023-07-11 04:16:32 +00:00
Commented Jul 11, 2023 at 4:16
@phuclv Very cool - though I'd argue that it's trading one form of complexity for another, it's definitely a big improvement for many use cases!

Royce Williams
– Royce Williams

2023-12-02 21:55:37 +00:00
Commented Dec 2, 2023 at 21:55

Add a comment |

jubilatious1 · Accepted Answer · 2023-07-12 18:03:09Z

Eight+ years later, I can't reproduce the OP's issue (MacOS 13.4 Ventura):

~$ printf 'ABC' | cut -b 2
B
~$ printf 'αβγ' | cut -b 2
�
~$ printf 'αβγ' | cut -c 2
β
~$ printf 'αβγ' | cut -c 3-4
γ
~$ printf 'αβγ' | cut -b 3-4
β
~$ printf 'αβγ' | wc -c
       6
~$ printf 'αβγ' | wc -m
       3

Above seems to be the answer the OP was hoping for? Note the line ending cut -c 3-4 actually returns γ% under zsh, indicating a partial line (more characters requested than could be returned).

-$ man cut doesn't give me a version other than macOS 13.4 August 3, 2017, IEEE Std 1003.2-1992 (“POSIX.2”), with an additional -w flag as an extension to the specification. "HISTORY: A cut command appeared in AT&T System III UNIX."

Stack Exchange Network

Can not use `cut -c` (`--characters`) with UTF-8?

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Can not use `cut -c` (`--characters`) with UTF-8?

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions