7

I'm getting back from a library what looks to be an incorrect unicode string:

>>> title
u'Sopet\xc3\xb3n'

Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?

The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?

I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)

It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:

>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón

But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?

2 Answers 2

11

a) Try to put it through the method below.

b)

>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
Sign up to request clarification or add additional context in comments.

3 Comments

Note 1) there is not a general way to recognize utf-8; this will recognize it because the UTF-8 decoder will check that all the multiple-byte sequences it's given are valid, and will raise an exception if any are not, 2) the encode-to-Latin-1 trick works because your code points are all less than 256, and Unicode's code points 0-255 correspond exactly to Latin-1's representation.
I'm not sure I completely understand your comment. Perhaps a specific counterexample would help. So far as I understand, the ".encode('latin-1')" is a no-op except that the result is a str rather than a unicode. Is there a string for which that will not be the case? I agree that there won't be a general way to detect UTF-8 inside a unicode string, as the UTF-8 encoded bytes will have a valid (if incorrect) interpretation inside a unicode string. For my purposes, I'm really only interested in latin-1 (for now), so this is sufficient.
@Watts: u'\u03b5\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ac means greek'.encode('latin1')
8

You should use:

>>> title.encode('raw_unicode_escape')

Python2:

print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))

Python3:

print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))

1 Comment

you saved my day. I had a unicode object with utf-8 bytes inside, and had to decode it back to 'normal' unicode. This solved it for me: my_str.encode('raw_unicode_escape').decode('utf-8'). I think this is a more general solution that the accepted answer, because it decodes strings not just in the 'latin-1' range. Thanks! :)

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.