Python unicode encoding using UTF-8

Question

I was following through python's tutorial on unicode and I've got a simple question to ask: When I open up a python shell and type:

>>> unicode('\x80abc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal
not in range(128)

I get the above error as expected since python attempts to convert the byte \x80 to unicode using the ascii encoding which can go as far as 127. (\x80 is 128).

However if I try again using th utf-8 encoding, I again get an error although somewhat different:

>>> unicode('\x80abc', 'utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid s
tart byte

What is going on here and how should I properly go about it?

what is the encoding in your console?

Aida Paul
– Aida Paul

2014-02-26 09:48:01 +00:00
Commented Feb 26, 2014 at 9:48 — Aida Paul
– Aida Paul, Commented Feb 26, 2014 at 9:48
It's a Windows cmd using codepage 737.

stratis
– stratis

2014-02-26 09:49:28 +00:00
Commented Feb 26, 2014 at 9:49 — stratis
– stratis, Commented Feb 26, 2014 at 9:49

Paulo Bu · Accepted Answer · 2014-02-26 09:59:49Z

3

It just happened that \x80 is not a valid byte in UTF-8 either.

Take a look at the charset for UTF-8 and see that the one byte codes finish in \x7f.

If you want to prove your example, try with latin1 and the ñ character: unicode('\xf1abc','latin1'). Without the encoding it will fail and with it it'll pass.

edited Feb 26, 2014 at 9:59

answered Feb 26, 2014 at 9:48

Paulo Bu

29.9k6 gold badges77 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alfe · Accepted Answer · 2014-02-26 09:57:36Z

First, '\x80abc' is a byte string (in Python < 3). If you want to convert a byte string to a unicode string you have two options: Either you reinterpret all bytes as single-byte unicode characters (you can simply prepend a u to the string literal then: u'\x80abc') or you assume that the bytes string is a unicode string encoded using a particular codec (like ASCII, Latin1, UTF-8, etc.); then you would go as you attempted: by decoding it.

Calling unicode() is an explicit decoding. And as Paulo pointed out, a \80 is not valid in UTF-8, as it is invalid in ASCII. You might try Latin1, though, this will work as it allows a \x80 byte in its stream.

+1 I didn't want to upvote you because I liked that we had the same reputation LOL but it is a good answer.

Collectives™ on Stack Overflow

Python unicode encoding using UTF-8

2 Answers 2

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Related