UnicodeEncodeError for Accented Characters with json.loads in Python

Question

I am scraping a json feed, and then trying to save certain fields to file.

First, I'm getting the json feed using urllib2

html = urllib2.urlopen(url).read()

Im then using json.loads

data = json.loads(html)

Im then trying to grab the 'Name' field for each item

for i in range (len(data["response"]["feeds"])):
    Name = str(data["body"]["events"][i]["Name"])

Whenever there is an accented character in the "Name" field, python will throw a UnicodeEncodeError

dont call str on it ... that forces ascii which has no accented characters — Joran Beasley
– Joran Beasley, Commented Nov 20, 2013 at 22:02

VooDooNOFX · Accepted Answer · 2014-05-31 03:08:51Z

This is a complex issue, which you should take a few moments to understand; unicode vs bytestrings in Python 2.X. I have found Ned Batchelder's Unicode Pain talk at PyCon 2012 to be endlessly helpful in understanding this.

As the pyvideo site is endlessly having troubles keeping videos online, here's a couple links to it:

This is especially important when scraping websites from unknown sources, and unknown encodings!

Edit: To summarize some information from nedbat's talk: You really should know what type of encoding your data comes in as from the target site. urllib2 will return bytes to you, which may or may not be able to be coerced unicode. In this case, your Name field may contain an accented character, which is a type of byte that cannot be converted into the standard ASCII tables (which are A-Z, a-z, 0-9, etc).

The solution is to decode these bytes into utf-8 (or some other encoding which can handle your accented characters) like so:

url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'  # A page containing raw unicode!
html = urllib2.urlopen(url).read().decode(u'utf-8', u'replace')  # Decode the contents of the page as utf-8 instead of bytes, replacing characters that can't convert into a ? character.

Here, you can compare the output of these 2 methods:

# Look at the last section of unicode data as bytes. Notice the \xef, signifying bytes, not unicode.
>>> urllib2.urlopen(url).read().splitlines()[-11]
'<dd>\xef\xbc\x81 \xef\xbc\x82 \xef\xbc\x83 \xef\xbc\x84 \xef\xbc\x85 \xef\xbc\x86 ... '

# Now, convert that data into unicode as you open the site.
>>> urllib2.urlopen(url).read().decode(u'utf-8').splitlines()[-11]
u'<dd>\uff01 \uff02 \uff03 \uff04 \uff05 \uff06 \uff07 \uff08 \uff09 \uff0a \uff0b ... '

In the first example, you can see that the data comes back as bytes, in the second, it's all unicode data.

This has a few caveats. Not every page can be decoded into utf-8, though it would be rare for this to happen.

Last piece of advice would be to switch to using the 3rd-party requests library, which will automatically handle unicode for you. An example:

>>> import requests
>>> url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'
>>> response = requests.get(url)

# You can get bytes out of the response:
>>> type(response.content)  # Returns bytes
<type 'str'>

# Or, you can get unicode out of it:
response.text  # Returns unicode
<type 'unicode'>

Using response.text, you can now pass this to json.loads(response.text) to successfully get unicode out of the results. Then, remove your str() wrapper.

Here's a link to the requests method reference used above.

-1, link-only answers are not welcome on StackOverflow. Please include a summary of the relevant information from the link, so that your answer can stand on its own.
@user4815162342. Thanks for the info. Updated my answer to be much much more detailed.

Joran Beasley · Accepted Answer · 2013-11-20 22:03:34Z

0

Name = data["body"]["events"][i]["Name"].decode('utf8')

is most likely what you want

the problem is that you are calling str(my_variable) and str will force it to be ascii which does not support accents

answered Nov 20, 2013 at 22:03

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

1 Comment

alpswd Over a year ago

I have tried this, and unfortunately it still throws the same UnicodeEncodeError

Collectives™ on Stack Overflow

UnicodeEncodeError for Accented Characters with json.loads in Python

2 Answers 2

2 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Related