6

I have some Python code that's receiving a string with bad unicode in it. When I try to ignore the bad characters, Python still chokes (version 2.6.1). Here's how to reproduce it:

s = 'ad\xc2-ven\xc2-ture'
s.encode('utf8', 'ignore')

It throws

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

What am I doing wrong?

2
  • 2
    Are you sure you don't want s.decode('utf8','ignore') instead? Commented May 25, 2011 at 13:08
  • Yup, you're right. Whoops :) Commented May 25, 2011 at 13:24

2 Answers 2

10

Converting a string to a unicode instance is str.decode() in Python 2.x:

 >>> s.decode("ascii", "ignore")
 u'ad-ven-ture'
Sign up to request clarification or add additional context in comments.

2 Comments

Note that with the OP's encoding (utf-8) instead of ASCII you'll get u'adventure'. I actually prefer unicode(utf8_string, 'utf-8', 'ignore') as it's clearer you're creating a unicode string.
There is also s.decode('ascii', 'replace') which can be used to get an idea of the issues.
8

You are confusing "unicode" and "utf-8". Your string s is not unicode; it's a bytestring in a particular encoding (but not UTF-8, more likely iso-8859-1 or such.) Going from a bytestring to unicode is done by decoding the data, not encoding. Going from unicode to bytestring is encoding. Perhaps you meant to make s a unicode string:

>>> s = u'ad\xc2-ven\xc2-ture'
>>> s.encode('utf8', 'ignore')
'ad\xc3\x82-ven\xc3\x82-ture'

Or perhaps you want to treat the bytestring as UTF-8 but ignore invalid sequences, in which case you would decode the bytestring with 'ignore' as the error handler:

>>> s = 'ad\xc2-ven\xc2-ture'
>>> u = s.decode('utf-8', 'ignore')
>>> u
u'adventure'
>>> u.encode('utf-8')
'adventure'

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.