Convert a UTF-8 String to a string in Python

Question

If I have a unicode string such as:

s = u'c\r\x8f\x02\x00\x00\x02\u201d'

how can I convert this to just a regular string that isn't in unicode format; i.e. I want to extract:

f = '\x00\x00\x02\u201d'

and I do not want it in unicode format. The reason why I need to do this is because I need to convert the unicode in s to an integer value, but if I try it with just s:

int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)

Traceback (most recent call last):
  File "<pyshell#48>", line 1, in <module>
    int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
  File "C:\Python27\lib\encodings\hex_codec.py", line 24, in hex_encode
    output = binascii.b2a_hex(input)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 3: ordinal not in range(128)

yet if I do it with f:

int(f.encode('hex'), 16)
664608376369508L

And this is the correct integer value I want to extract from s. Is there a method where I can do this?

If you want the character \u201d in there then by definition you want a Unicode string. You should review your requirements and probably update your question with an unambiguous problem statement. — tripleee
– tripleee, Commented Feb 1, 2016 at 18:06
Why do you discard the c\r\x8f\x02? Also, s is not UTF-8, and \u201d in a bytestring literal produces an actual backslash and the characters u201d, so if you really want that result (and 664608376369508L would seem to indicate you do), you've got a really weird conversion in mind. Maybe you messed up your data somewhere upstream, and you should fix it there. — user2357112
– user2357112, Commented Feb 1, 2016 at 18:11
I don't fully understand what the \u201d character is. This protocol talks to a device that sends back s. In s, only what's listed in f contains data. I need to decode f into an integer. (The 664608376369508L I listed is not correct). Normally, the device sends back something like: \x00\x00\x03\xcc which I can easily convert to 972, but when I receive something like: \u201d or similar, I don't know how to handle it. — Mink
– Mink, Commented Feb 2, 2016 at 18:13

bobince · Accepted Answer · 2016-02-02 21:34:13Z

Normally, the device sends back something like: \x00\x00\x03\xcc which I can easily convert to 972

OK, so I think what's happening here is you're trying to read four bytes from a byte-oriented device, and decode that to an integer, interpreting the bytes as a 32-bit word in big-endian order.

To do this, use the struct module and byte strings:

>>> struct.unpack('>i', '\x00\x00\x03\xCC')[0]
972

(I'm not sure why you were trying to reverse the string then hex-encode; that would put the bytes in the wrong order and give much too large output.)

I don't know how you're reading from the device, but at some point you've decoded the bytes into a text (Unicode) string. Judging from the U+201D character in there I would guess that the device originally gave you a byte 0x94 and you decoded it using code page 1252 or another similar Windows default (‘ANSI’) code page.

>>> struct.unpack('>i', '\x00\x00\x02\x94')[0]
660

It may be possible to reverse the incorrect decoding step by encoding back to bytes using the same mapping, but this is dicey and depends on which encoding are involved (not all bytes are mapped to anything usable in all encodings). Better would be to look at where the input is coming from, find where that decode step is happening, and get rid of it so you keep hold of the raw bytes the device sent you.

Ok, I will look into exactly how the device is sending me back the bytes and if I can get the raw bytes. Thank you.

Collectives™ on Stack Overflow

Convert a UTF-8 String to a string in Python

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related