18

As a C programmer, I was surprised to see that wc -c (which count the number of bytes), and wc -m (which counts the number of characters) output very different results for a long, text file of mine. I had always been told that sizeof(char) is 1 byte.

qdii@nomada ~/Documents $ wc -c sentences.csv
102990983 sentences.csv
qdii@nomada ~/Documents $ wc -m sentences.csv
89023123 sentences.csv

Any explanation?

1
  • See @rici's answer below... you've got your -m and -c flags backwards in your question (c = bytes, m = characters)... your example output is correct, though. Commented Oct 16, 2012 at 19:16

3 Answers 3

24

The char type in C is one byte, but it's intended for ASCII characters; there are variable-width encodings like UTF-8 that can take up many bytes per character. wc uses the mbrtowc(3) function to decode multibyte sequences, depending on the locale set by the LC_CTYPE environment variable. If you set the locale properly, you should get the same result for all cases. For example:

qdii@nomada ~/Documents $ LC_CTYPE="C" wc -m sentences.csv
102990983 sentences.csv
18

At a guess,

  1. Your locale uses UTF-8 encoding, and

  2. About 10% of your file consists of characters which require more than one octet to encode into UTF-8.

By the way, from man wc:

   -c, --bytes
          print the byte counts

   -m, --chars
          print the character counts
0
2

Minimal example

Consider the Unicode character "é" known as "LATIN SMALL LETTER E WITH ACUTE", which is an 'e' with an acute accent used in several European languages.

Its UTF-8 encoding is two bytes long "0xc3 0xa9".

With that in mind we see:

printf '\xc3\xa9' | LC_CTYPE=en_US.UTF-8 wc -c
printf '\xc3\xa9' | LC_CTYPE=en_US.UTF-8 wc -m
printf '\xc3\xa9' | LC_CTYPE=C wc -c
printf '\xc3\xa9' | LC_CTYPE=C wc -m

outputs:

2
1
2
2

So we understand as explained at https://unix.stackexchange.com/a/51948/32558 that to get the correct UTF-8 count we need both wc -m and LC_CTYPE=en_US.UTF-8.

In my system, the outcome is the same if I use the input method to type a literal é:

printf 'é' | LC_CTYPE=en_US.UTF-8 wc -c

Tested on Ubuntu 21.04.

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.