0

I am scraping a json feed, and then trying to save certain fields to file.

First, I'm getting the json feed using urllib2

html = urllib2.urlopen(url).read()

Im then using json.loads

data = json.loads(html)

Im then trying to grab the 'Name' field for each item

for i in range (len(data["response"]["feeds"])):
    Name = str(data["body"]["events"][i]["Name"])

Whenever there is an accented character in the "Name" field, python will throw a UnicodeEncodeError

2
  • dont call str on it ... that forces ascii which has no accented characters Commented Nov 20, 2013 at 22:02
  • what is Name and why is it camelcased Commented Nov 20, 2013 at 22:39

2 Answers 2

3

This is a complex issue, which you should take a few moments to understand; unicode vs bytestrings in Python 2.X. I have found Ned Batchelder's Unicode Pain talk at PyCon 2012 to be endlessly helpful in understanding this.

As the pyvideo site is endlessly having troubles keeping videos online, here's a couple links to it:

  1. http://pyvideo.org/video/948/
  2. http://www.youtube.com/watch?feature=player_embedded&v=sgHbC6udIqc

This is especially important when scraping websites from unknown sources, and unknown encodings!


Edit: To summarize some information from nedbat's talk: You really should know what type of encoding your data comes in as from the target site. urllib2 will return bytes to you, which may or may not be able to be coerced unicode. In this case, your Name field may contain an accented character, which is a type of byte that cannot be converted into the standard ASCII tables (which are A-Z, a-z, 0-9, etc).

The solution is to decode these bytes into utf-8 (or some other encoding which can handle your accented characters) like so:

url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'  # A page containing raw unicode!
html = urllib2.urlopen(url).read().decode(u'utf-8', u'replace')  # Decode the contents of the page as utf-8 instead of bytes, replacing characters that can't convert into a ? character.

Here, you can compare the output of these 2 methods:

# Look at the last section of unicode data as bytes. Notice the \xef, signifying bytes, not unicode.
>>> urllib2.urlopen(url).read().splitlines()[-11]
'<dd>\xef\xbc\x81 \xef\xbc\x82 \xef\xbc\x83 \xef\xbc\x84 \xef\xbc\x85 \xef\xbc\x86 ... '

# Now, convert that data into unicode as you open the site.
>>> urllib2.urlopen(url).read().decode(u'utf-8').splitlines()[-11]
u'<dd>\uff01 \uff02 \uff03 \uff04 \uff05 \uff06 \uff07 \uff08 \uff09 \uff0a \uff0b ... '

In the first example, you can see that the data comes back as bytes, in the second, it's all unicode data.

This has a few caveats. Not every page can be decoded into utf-8, though it would be rare for this to happen.


Last piece of advice would be to switch to using the 3rd-party requests library, which will automatically handle unicode for you. An example:

>>> import requests
>>> url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'
>>> response = requests.get(url)

# You can get bytes out of the response:
>>> type(response.content)  # Returns bytes
<type 'str'>

# Or, you can get unicode out of it:
response.text  # Returns unicode
<type 'unicode'>

Using response.text, you can now pass this to json.loads(response.text) to successfully get unicode out of the results. Then, remove your str() wrapper.

Here's a link to the requests method reference used above.

Sign up to request clarification or add additional context in comments.

2 Comments

-1, link-only answers are not welcome on StackOverflow. Please include a summary of the relevant information from the link, so that your answer can stand on its own.
@user4815162342. Thanks for the info. Updated my answer to be much much more detailed.
0
Name = data["body"]["events"][i]["Name"].decode('utf8')

is most likely what you want

the problem is that you are calling str(my_variable) and str will force it to be ascii which does not support accents

1 Comment

I have tried this, and unfortunately it still throws the same UnicodeEncodeError

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.