Python encoding over unicode strings

Question

So in a python terminal I type the following:

>>> s = "γειά"       ## it just means 'hi' in Greek
>>> s
'\x9a\x9c\xa0\xe1'   ## What is this? - Is it utf-encoding? Is it ascii escaped?
>>> print s
γειά

and now the fun part:

>>> a = u"γειά"
>>> a
u'\u03b3\u03b5\u03b9\u03ac'    # Again what is this? utf-8 encoded? If so, how?
>>> print a
γειά

I am totally confused over encodings and particularly on utf-8 encoded strings and/or ascii encoded strings. What would be the difference between the above 2 snippets and how do they tie-in the unicode function?

>>> result = unicode(s)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 0: ordinal
                     not in range(128)

>>> result = unicode(s, 'utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid s
                     tart byte

Could someone explain to me what's happening here? Thanks in advance.

See also: Pragmatic Unicode or How Do I Stop the Pain? which covers many of the same points as the Joel article, plus some more regarding how to actually solve these problems. — user395760
– user395760, Commented Feb 26, 2014 at 12:02

Paulo Bu · Accepted Answer · 2014-02-26 14:31:54Z

On your first attempt you're seeing the encoded version of the string, and not in utf-8 at all:

>>> s='\x9a\x9c\xa0\xe1'
>>> s.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid start byte

It is encoding in whatever encode your shell is using.

On your second example, you're creating an unicode string. Python, armed with your shell encoding, is able to decode it from the input and store it as unicode codepoints (\u03b3\u03b5\u03b9\u03ac). Later, when you print it, Python also knows your shell's encoding and is able to encode it from unicode to actual bytes.

About your third example, you're using unicode function explicitly. Which when used without an encoding as argument, it will use ascii as default. As there's no way ascii support Greek characters, Python is complaining about that.

Bottom line, you need to know what encoding your console is using to figure out exactly what Python is doing with your code. If you are on Windows you can do this with chcp command. On Linux you can use locale command.

Of course I forgot the most important advice ever :P. As @thg435 suggested this is a must read: Unicode by Joel

Also is worth mentioning that a lot of these changes dramatically in Python 3.

It is probably encoding in whatever encode your shell is using. It most definitely is.
Yes haha, sometimes I found it hard to be definitive about my conclusions :)

Collectives™ on Stack Overflow

Python encoding over unicode strings

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related