2

I was following through python's tutorial on unicode and I've got a simple question to ask: When I open up a python shell and type:

>>> unicode('\x80abc')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal
not in range(128)

I get the above error as expected since python attempts to convert the byte \x80 to unicode using the ascii encoding which can go as far as 127. (\x80 is 128).

However if I try again using th utf-8 encoding, I again get an error although somewhat different:

>>> unicode('\x80abc', 'utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid s
tart byte

What is going on here and how should I properly go about it?

2
  • what is the encoding in your console? Commented Feb 26, 2014 at 9:48
  • It's a Windows cmd using codepage 737. Commented Feb 26, 2014 at 9:49

2 Answers 2

3

It just happened that \x80 is not a valid byte in UTF-8 either.

Take a look at the charset for UTF-8 and see that the one byte codes finish in \x7f.

If you want to prove your example, try with latin1 and the ñ character: unicode('\xf1abc','latin1'). Without the encoding it will fail and with it it'll pass.

Sign up to request clarification or add additional context in comments.

Comments

1

First, '\x80abc' is a byte string (in Python < 3). If you want to convert a byte string to a unicode string you have two options: Either you reinterpret all bytes as single-byte unicode characters (you can simply prepend a u to the string literal then: u'\x80abc') or you assume that the bytes string is a unicode string encoded using a particular codec (like ASCII, Latin1, UTF-8, etc.); then you would go as you attempted: by decoding it.

Calling unicode() is an explicit decoding. And as Paulo pointed out, a \80 is not valid in UTF-8, as it is invalid in ASCII. You might try Latin1, though, this will work as it allows a \x80 byte in its stream.

2 Comments

+1 I didn't want to upvote you because I liked that we had the same reputation LOL but it is a good answer.
Let's see how long we can keep that same reputation :-D

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.