Python 'ascii' encode problems in print statement

Question

System: python 3.4.2 on linux.

I'm woring on a django application (irrelevant), and I encountered a problem that it throws

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

when print is called (!). After quite a bit of digging, I discovered I should check

>>> sys.getdefaultencoding()
'utf-8'

but it was as expected, utf8. I noticed also that os.path.exists throws the same exception when used with a unicode string. So I checked

>>> sys.getfilesystemencoding()
'ascii'

When I used LANG=en_US.UTF-8 the issue disappeared. I understand now why os.path.exists had problems with that. But I have absolutely no clue why print statement is affected by the filesystem setting. Is there a third setting I'm missing? Or does it just assume LANG environment is to be trusted for everything?

Also... I don't get the reasoning here. LANG does not tell what encoding is supported by the filenames. It has nothing to do with that. It's set separately for the current environment, not for the filesystem. Why is python using this setting for filesystem filenames? It makes applications very fragile, as all the file operations just break when run in an environment where LANG is not set or set to C (not uncommon, especially when a web-app is run as root or a new user created specifically for the daemon).

Test code (no actual unicode input needed to avoid terminal encoding pitfalls):

x=b'\xc4\x8c\xc5\xbd'
y=x.decode('utf-8')
print(y)

Question:

is there a good and accepted way of making the application robust to the LANG setting?
is there any real-world reason to guess the filesystem capabilities from environment instead of the filesystem driver?
why is print affected?

@MartijnPieters oh, that's part of the answer I was looking for. It was 'ANSI_X3.4-1968'. Terrifying. However it stil begs the question why is filename encoding guessed from LANG. — orion
– orion, Commented Dec 12, 2014 at 16:47
LANG dictates everything in a POSIX locale: Debian thinks my file system is encoded as ISO-8859-1 — Martijn Pieters
– Martijn Pieters, Commented Dec 12, 2014 at 16:49

Martijn Pieters · Accepted Answer · 2014-12-12 17:01:50Z

LANG is used to determine your locale; if you don't set specific LC_ variables the LANG variable is used as the default.

The filesystem encoding is determined by the LC_CTYPE variable, but if you haven't set that variable specifically, the LANG environment variable is used instead.

Printing uses sys.stdout, a textfile configured with the codec your terminal uses. Your terminal settings is also locale specific; your LANG variable should really reflect what locale your terminal is set to. If that is UTF-8, you need to make sure your LANG variable reflects that. sys.stdout uses locale.getpreferredencoding(False) (like all text streams opened without an explicit encoding set) and on POSIX systems that'll use LC_CTYPE too.

Collectives™ on Stack Overflow

Python 'ascii' encode problems in print statement

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related