Python failing to encode bad unicode to ascii

Question

I have some Python code that's receiving a string with bad unicode in it. When I try to ignore the bad characters, Python still chokes (version 2.6.1). Here's how to reproduce it:

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

It throws

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

What am I doing wrong?

Are you sure you don't want s.decode('utf8','ignore') instead? — Dan
– Dan, Commented May 25, 2011 at 13:08

Sven Marnach · Accepted Answer · 2011-05-25 13:09:40Z

10

Converting a string to a unicode instance is str.decode() in Python 2.x:

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'

answered May 25, 2011 at 13:09

Sven Marnach

607k123 gold badges966 silver badges865 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ben Hoyt Over a year ago

Note that with the OP's encoding (utf-8) instead of ASCII you'll get u'adventure'. I actually prefer unicode(utf8_string, 'utf-8', 'ignore') as it's clearer you're creating a unicode string.

Wernight Over a year ago

There is also s.decode('ascii', 'replace') which can be used to get an idea of the issues.

Thomas Wouters · Accepted Answer · 2011-05-25 13:09:54Z

You are confusing "unicode" and "utf-8". Your string s is not unicode; it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode is done by decoding the data, not encoding. Going from unicode to bytestring is encoding. Perhaps you meant to make s a unicode string:

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'

Collectives™ on Stack Overflow

Python failing to encode bad unicode to ascii

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related