This is a complex issue, which you should take a few moments to understand; unicode vs bytestrings in Python 2.X. I have found Ned Batchelder's Unicode Pain talk at PyCon 2012 to be endlessly helpful in understanding this.
As the pyvideo site is endlessly having troubles keeping videos online, here's a couple links to it:
- http://pyvideo.org/video/948/
- http://www.youtube.com/watch?feature=player_embedded&v=sgHbC6udIqc
This is especially important when scraping websites from unknown sources, and unknown encodings!
Edit:
To summarize some information from nedbat's talk: You really should know what type of encoding your data comes in as from the target site. urllib2 will return bytes to you, which may or may not be able to be coerced unicode. In this case, your Name field may contain an accented character, which is a type of byte that cannot be converted into the standard ASCII tables (which are A-Z, a-z, 0-9, etc).
The solution is to decode these bytes into utf-8 (or some other encoding which can handle your accented characters) like so:
url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html' # A page containing raw unicode!
html = urllib2.urlopen(url).read().decode(u'utf-8', u'replace') # Decode the contents of the page as utf-8 instead of bytes, replacing characters that can't convert into a ? character.
Here, you can compare the output of these 2 methods:
# Look at the last section of unicode data as bytes. Notice the \xef, signifying bytes, not unicode.
>>> urllib2.urlopen(url).read().splitlines()[-11]
'<dd>\xef\xbc\x81 \xef\xbc\x82 \xef\xbc\x83 \xef\xbc\x84 \xef\xbc\x85 \xef\xbc\x86 ... '
# Now, convert that data into unicode as you open the site.
>>> urllib2.urlopen(url).read().decode(u'utf-8').splitlines()[-11]
u'<dd>\uff01 \uff02 \uff03 \uff04 \uff05 \uff06 \uff07 \uff08 \uff09 \uff0a \uff0b ... '
In the first example, you can see that the data comes back as bytes, in the second, it's all unicode data.
This has a few caveats. Not every page can be decoded into utf-8, though it would be rare for this to happen.
Last piece of advice would be to switch to using the 3rd-party requests library, which will automatically handle unicode for you. An example:
>>> import requests
>>> url = 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html'
>>> response = requests.get(url)
# You can get bytes out of the response:
>>> type(response.content) # Returns bytes
<type 'str'>
# Or, you can get unicode out of it:
response.text # Returns unicode
<type 'unicode'>
Using response.text, you can now pass this to json.loads(response.text) to successfully get unicode out of the results. Then, remove your str() wrapper.
Here's a link to the requests method reference used above.
stron it ... that forces ascii which has no accented characters