Return to Revisions

6 of 19

added 643 characters in body

edited Jun 16, 2014 at 14:59

585.1k
96
1.1k
1.7k

LC_ALL is the environment variable that overrides all the other localisation settings (except $LANGUAGE under some circumstances).

Different aspects of localisations (like the thousand separator or decimal point character, character set, sorting order, month, day names, currency symbol) can be set using a few variables.

You'll typically set $LANG to your preference. The individual LC_xxx variables override a certain aspect. LC_ALL overrides them all. The locale command, when called without argument gives a summary of the current settings.

For instance, on a GNU system, I get:

$ locale
LANG=en_GB.UTF-8
LANGUAGE=
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

I can override an individual setting with for instance:

$ LC_TIME=fr_FR.UTF-8 date
jeudi 22 août 2013, 10:41:30 (UTC+0100)

Or:

$ LC_MONETARY=fr_FR.UTF-8 locale currency_symbol
€

Or override everything with LC_ALL.

$ LC_ALL=C LANG=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 cat /
cat: /: Is a directory

In a script, if you want to force a specific setting, as you don't know what settings the user has forced (possibly LC_ALL as well), your best, safest and generally only option is to force LC_ALL.

The C locale is a special locale that is meant to be the simplest locale. You could also say that while the other locales are for humans, the C locale is for computers. In the C locale, characters are single bytes, the charset is ASCII, the sorting order is based on the byte values, the language is US english and things like currency symbols are not defined.

On some systems, there's a difference with the POSIX locale where for instance the sort order for non-ASCII characters is not defined.

You generally run a command with LC_ALL=C to avoid the user's settings to interfere with your script. For instance, if you want [a-z] to match the 26 ASCII characters from a to z, you have to set LC_ALL=C.

On GNU systems, LC_ALL=C and LC_ALL=POSIX (or LC_MESSAGES=C|POSIX) override $LANGUAGE, while LC_ALL=anything-else wouldn't.

A few cases where you typically need to set LC_ALL=C:

sort -u or sort ... | uniq.... In many locales other than C, some characters have the same sorting order. sort -u doesn't report unique lines, but one of each group of lines that have equal sorting order. So if you do want unique lines, you need a locale where characters are byte and all characters have different sorting order (which the C locale guarantees).
Character ranges like in grep. If you mean to match a letter in the user's language, use grep '[[:alpha:]]' and don't modify LC_ALL. But if you want to match the a-zA-Z ASCII characters, you need either LC_ALL=C grep '[[:alpha:]]' or LC_ALL=C grep '[a-zA-Z]'. [a-z] matches the characters that sort after a and before z. In other locales, you generally don't know what those are. For instance some locales ignore case for sorting so [a-z] could include [B-Z] or [A-Y]. In many UTF-8 locales (including en_US.UTF-8 on most systems), [a-z] will include the latin letters from a to y with diacritics but not those of z (since z sorts before them) which I can't imagine would be what you want (why would you want to include é and not ź?).
floating point arithmetic in ksh93. ksh93 honours the decimal_point setting in LC_NUMERIC. If you write a script that contains a=$((1.2/7)), it will stop working when run by a user whose locale has comma as the decimal separator:
```
 $ ksh93 -c 'echo $((1.1/2))'
 0.55
 $ LANG=fr_FR.UTF-8  ksh93 -c 'echo $((1.1/2))'
 ksh93: 1.1/2: arithmetic syntax error
```

Then you need things like:

    #! /bin/ksh93 -
    float input="$1" # get it as input from the user in his locale
    float output
    arith() { typeset LC_ALL=C; (($@)); }
    arith output=input/1.2 # use the dot here as it will be interpreted
                           # under LC_ALL=C
    echo "$output" # output in the user's locale

As a side note: the , decimal separator conflicts with the , arithmetic operator which can cause even more confusion.

When you need characters to be bytes. Nowadays, most locales are UTF-8 based which means characters can take up from 1 to 6 bytes. When dealing with data that is meant to be bytes, with text utilities, you'll want to set LC_ALL=C. It will also improve performance significantly because parsing UTF-8 data has a cost.
a corollary of the previous point: when processing text where you don't know what character set the input is written in, but can assume it's compatible with ASCII (as virtually all charsets are). For instance grep '<.*>' to look for lines containing a <, > pair will no work if you're in a UTF-8 locale and the input is encoded in a single-byte 8-bit character set like iso8859-15. That's because . only matches characters and non-ASCII characters in iso8859-15 are likely not to form a valid character in UTF-8. On the other hand, LC_ALL=C grep '<.*>' will work because any byte value forms a valid character in the C locale.
Any time where you process input data or output data that is not intended from/for a human. If you're talking to a user, you may want to use their convention and language, but for instance, if you generate some numbers to feed some other application that expects English style decimal points, or English month names, you'll want to set LC_ALL=C:
```
 $ printf '%g\n' 1e-2
 0,01
 $ LC_ALL=C printf '%g\n' 1e-2
 0.01
 $ date +%b
 août
 $ LC_ALL=C date +%b
 Aug
```

answered Aug 22, 2013 at 9:50

Stéphane Chazelas

585.1k
96
1.1k
1.7k