2

Is there a function in python that is equivalent to prefixing a string by 'u'?

Let's say I have a string:

a = 'C\xc3\xa9dric Roger'

and I want to convert it to:

b = u'C\xc3\xa9dric Roger'

so that I can compare it to other unicode objects. How can I do this? My first instinct was to try:

>>>> b = unicode(a)
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

But that seems to be trying to decode the string. Is there a function for casting to unicode without doing any kind of decoding? (Is that what the 'u' prefix does or have I misunderstood?)

1 Answer 1

7

You need to specify an encoding:

unicode(a, 'utf8')

or, using str.decode():

a.decode('utf8')

but do pick the right codec for your input; you clearly have UTF-8 data here but that may not always be the case.

To understand what this does, I urge you to read:

Sign up to request clarification or add additional context in comments.

12 Comments

Sorry if I'm being stupid here but unicode('C\xc3\xa9dric Roger','utf8') doesn't yield u'C\xc3\xa9dric Roger'...
@JohnGreenall: No, because you now have a Unicode value; C3 A9 is the UTF-8 encoding for the U+00E9 codepoint in the Unicode standard, a.k.a. LATIN SMALL LETTER E WITH ACUTE. Python will display that as u'\xe9' when representing the unicode string.
@JohnGreenall: Again, please do read the links included in my answer, there are some fundamental concepts you need to understand here.
If you really want to get u'C\xc3\xa9dric Roger' then the encoding would be iso-8859-1, but as Martijn says that seems unlikely to be the right thing, unless the guy's name really is Cédric (I'm glad I'm not called that).
@JohnGreenall: Yes, if the Mongo driver is returning a Unicode value with UTF-8 bytes in it, then that is a bug in that driver, or someone inserted the value that way. You can encode to Latin 1 (which encodes the first 256 Unicode codepoints one-on-one to bytes), then decode from UTF-8: mongovalue.encode('latin1').decode('utf8'). That'll 'repair' the value.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.