0

So in a python terminal I type the following:

>>> s = "γειά"       ## it just means 'hi' in Greek
>>> s
'\x9a\x9c\xa0\xe1'   ## What is this? - Is it utf-encoding? Is it ascii escaped?
>>> print s
γειά

and now the fun part:

>>> a = u"γειά"
>>> a
u'\u03b3\u03b5\u03b9\u03ac'    # Again what is this? utf-8 encoded? If so, how?
>>> print a
γειά

I am totally confused over encodings and particularly on utf-8 encoded strings and/or ascii encoded strings. What would be the difference between the above 2 snippets and how do they tie-in the unicode function?

>>> result = unicode(s)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 0: ordinal
                     not in range(128)

>>> result = unicode(s, 'utf-8')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid s
                     tart byte

Could someone explain to me what's happening here? Thanks in advance.

2
  • 2
    An obligatorty Joel link Commented Feb 26, 2014 at 11:59
  • 2
    See also: Pragmatic Unicode or How Do I Stop the Pain? which covers many of the same points as the Joel article, plus some more regarding how to actually solve these problems. Commented Feb 26, 2014 at 12:02

1 Answer 1

2

On your first attempt you're seeing the encoded version of the string, and not in utf-8 at all:

>>> s='\x9a\x9c\xa0\xe1'
>>> s.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 0: invalid start byte

It is encoding in whatever encode your shell is using.

On your second example, you're creating an unicode string. Python, armed with your shell encoding, is able to decode it from the input and store it as unicode codepoints (\u03b3\u03b5\u03b9\u03ac). Later, when you print it, Python also knows your shell's encoding and is able to encode it from unicode to actual bytes.

About your third example, you're using unicode function explicitly. Which when used without an encoding as argument, it will use ascii as default. As there's no way ascii support Greek characters, Python is complaining about that.

Bottom line, you need to know what encoding your console is using to figure out exactly what Python is doing with your code. If you are on Windows you can do this with chcp command. On Linux you can use locale command.

Of course I forgot the most important advice ever :P. As @thg435 suggested this is a must read: Unicode by Joel

Also is worth mentioning that a lot of these changes dramatically in Python 3.

Sign up to request clarification or add additional context in comments.

2 Comments

It is probably encoding in whatever encode your shell is using. It most definitely is.
Yes haha, sometimes I found it hard to be definitive about my conclusions :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.