4

Before you go telling me to read PEP 0263, keep reading...

I can't find any documentation that details which file encodings are supported for Python 3 source files.

I've found hundreds (thousands?) of questions, answers, posts, emails, etc. about how to declare - at the top of your source file - the encoding of that source file, but none of them answer my question. Bear with me and imagine doing (or actually try) the following:

  1. Open Notepad (I'm using regular old Notepad on Windows 7, but I doubt it matters; I'm sure your superior editor can do something similar.)
  2. Type your favorite line of Python code ( I used print( 'Hello, world!' ) )
  3. Select "File" -> "Save"
  4. Select a folder and file name ( I used "E:\Temp\hello.py" )
  5. Change the "Encoding:" setting from the default "ANSI" to "Unicode"
  6. Press "Save"
  7. Open a command prompt, change to the folder containing your new file, and try to run it

Here's the output I get:

E:\Temp>python --version
Python 3.4.1

E:\Temp>python "hello.py"
  File "hello.py", line 1
SyntaxError: Non-UTF-8 code starting with '\xff' in file hello.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Now, when I open this same file in Notepad++ and look at the "Encoding" menu, it has the option "Encode in UCS-2 Little Endian" selected. Wikipedia tells me that this is basically UTF-16 encoding. Whatever. I don't really care. More research reveals that my editor has inserted a two-byte BOM (Byte Order Mark) with a value of '\xff\xfe' at the front of the file to indicate the file encoding. So at least I know where the '\xff' code that Python is complaining about comes from.

So I go and read PEP 0263 - and everything else regarding it - on the web, and I try adding a comment like this to the first line of the file

# coding: utf-16

with all sorts of different values for the encoding, and nothing helps. But it can't help, right? Because Python isn't even getting as far as my encoding declaration; It's choking on the first byte of the source file!

So what I really want to know is...

  1. Why can't the Python 3 interpreter read this file?
  2. If "Unicode" or "UCS-2 Little Endian" or "UTF-16" or whatever isn't supported, what is???

P.S. I even found another question on StackOverflow which seems to be the exact issue I'm having, but it was closed - erroneously in my opinion - as a duplicate. :(

--- EDIT ---

Someone asked for my "compiled options". Here's some output. Maybe it will help?

E:\Temp>python
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:38:22) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sysconfig
>>> print( sysconfig.get_config_vars() )
{'EXT_SUFFIX': '.pyd', 'srcdir': 'C:\\Python34', 'py_version_short': '3.4', 'base': 'C:\\Python34', 'prefix': 'C:\\Python34', 'projectbase': 'C:\\Python34', 'INCLUDEPY': 'C:\\Python34\\Include', 'platbase': 'C:\\Python34', 'py_version_nodot': '34', 'exec_prefix': 'C:\\Python34', 'EXE': '.exe', 'installed_base': 'C:\\Python34', 'SO': '.pyd', 'installed_platbase': 'C:\\Python34', 'VERSION': '34', 'BINLIBDEST': 'C:\\Python34\\Lib', 'LIBDEST': 'C:\\Python34\\Lib', 'userbase': 'C:\\Users\\alonghi\\AppData\\Roaming\\Python', 'py_version': '3.4.1', 'abiflags': '', 'BINDIR': 'C:\\Python34'}
>>>
6
  • Can you post your entire hello.py file, from top to bottom, including the "shebang" #!/bin/env python or whatever. Also, your compiled options may help: import sysconfig; print(sysconfig.get_config_vars()) Commented Oct 1, 2014 at 0:13
  • @jedwards The file contains a single line of code, as stated. Commented Oct 1, 2014 at 0:25
  • @also, thanks for the "clarification", but it doesn't help much. That being said, maybe consult this. I have no idea whether it's the list you're interested in, but it seems plausable. Good luck with your question ... Commented Oct 1, 2014 at 0:30
  • "But it can't help, right? Because Python isn't even getting as far as my encoding declaration; It's choking on the first byte of the source file!" Yes, because UTF-16 encoding uses bytes that can't be understood using the default encoding (ASCII in Python 2; UTF-8 in Python 3). "Why can't the Python 3 interpreter read this file?" Because it has to be able to read the encoding declaration before it could switch that encoding. Commented Mar 30, 2023 at 11:14
  • "If "Unicode" or "UCS-2 Little Endian" or "UTF-16" or whatever isn't supported, what is???" Ones in which the coding declaration would match the byte regex described in PEP 263, as described by the text of PEP 263. This falls out automatically from the fact that the PEP was authored wayyyyyy back in 2001, when str meant a sequence of bytes that was only pretending to be a string. Commented Mar 30, 2023 at 11:16

1 Answer 1

7

A source encoding must be:

  1. An encoding supported by the version of Python in question. (This varies by version and platform, for example you only get mbcs on Windows.)

  2. Loosely ASCII-compatible, enough that the # coding: declaration can be read using ascii which is the initial source encoding before any declaration is read. See PEP0263 ‘Concepts’ item 1.

The encoding that Windows misleadingly calls “Unicode”, UTF-16LE, is not ASCII-compatible (and generally is a barrel of problems you should try to avoid using). Python would need special encoding-specific support to detect UTF-16 source files and this feature has been declined for now.

The # coding: you should use is almost invariably UTF-8.

Sign up to request clarification or add additional context in comments.

2 Comments

So the answer was there in PEP0263 ('Concepts' item 1): "It does not include encodings which use two or more bytes for all characters like e.g. UTF-16." Thanks for that. This requirement is not spelled out very clearly anywhere that I have found, a complaint repeated in the bug/issue/feature-request you pointed out ("Cannot write source code in UTF16"). Thanks for that reference, too. Much appreciated!
Python3 code is unicode. When reading bytes from an external source, the interpreter assume UTF-8 encoding unless the first line after an optional #! line says otherwise. Similarly, Idle writes with utf-8 encoding unless directed otherwise. So an explicit UTF-8 should not be needed.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.