Why are wc -m and wc -c different?

Question

As a C programmer, I was surprised to see that wc -c (which count the number of bytes), and wc -m (which counts the number of characters) output very different results for a long, text file of mine. I had always been told that sizeof(char) is 1 byte.

qdii@nomada ~/Documents $ wc -c sentences.csv
102990983 sentences.csv
qdii@nomada ~/Documents $ wc -m sentences.csv
89023123 sentences.csv

Any explanation?

See @rici's answer below... you've got your -m and -c flags backwards in your question (c = bytes, m = characters)... your example output is correct, though. — Dan
– Dan, Commented Oct 16, 2012 at 19:16

Michael Mrozek · Accepted Answer · 2012-10-16 00:59:50Z

24

The char type in C is one byte, but it's intended for ASCII characters; there are variable-width encodings like UTF-8 that can take up many bytes per character. wc uses the mbrtowc(3) function to decode multibyte sequences, depending on the locale set by the LC_CTYPE environment variable. If you set the locale properly, you should get the same result for all cases. For example:

qdii@nomada ~/Documents $ LC_CTYPE="C" wc -m sentences.csv
102990983 sentences.csv

answered Oct 16, 2012 at 0:59

Michael Mrozek

95.7k40 gold badges245 silver badges236 bronze badges

Add a comment |

rici · Accepted Answer · 2012-10-16 00:57:41Z

18

At a guess,

Your locale uses UTF-8 encoding, and
About 10% of your file consists of characters which require more than one octet to encode into UTF-8.

By the way, from man wc:

   -c, --bytes
          print the byte counts

   -m, --chars
          print the character counts

answered Oct 16, 2012 at 0:57

rici

9,9501 gold badge42 silver badges39 bronze badges

Add a comment |

Ciro Santilli OurBigBook.com · Accepted Answer · 2021-07-11 09:54:07Z

Minimal example

Consider the Unicode character "é" known as "LATIN SMALL LETTER E WITH ACUTE", which is an 'e' with an acute accent used in several European languages.

Its UTF-8 encoding is two bytes long "0xc3 0xa9".

With that in mind we see:

printf '\xc3\xa9' | LC_CTYPE=en_US.UTF-8 wc -c
printf '\xc3\xa9' | LC_CTYPE=en_US.UTF-8 wc -m
printf '\xc3\xa9' | LC_CTYPE=C wc -c
printf '\xc3\xa9' | LC_CTYPE=C wc -m

outputs:

So we understand as explained at https://unix.stackexchange.com/a/51948/32558 that to get the correct UTF-8 count we need both wc -m and LC_CTYPE=en_US.UTF-8.

In my system, the outcome is the same if I use the input method to type a literal é:

printf 'é' | LC_CTYPE=en_US.UTF-8 wc -c

Tested on Ubuntu 21.04.

Stack Exchange Network

Why are wc -m and wc -c different?

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Why are wc -m and wc -c different?

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions