2

in the process of parsing xml from a website, I've managed to get lost in a bunch of utf-8 encoding issues. Specifically, I have strings that look like:

u'PA_g\xc3\xa9p7'

When I print this I get:

>> PA_gép7

What I want instead comes from the following

print('PA_g\xc3\xa9p7')
>> PA_gép7

Here is my code:

def get_api_xml_response(base_url, query_str):
"""gets xml from api @ base_url using query_str"""
  res = requests.get(u'{}{}'.format(base_url, query_str))
  xmlstring = clean_up_xml(res.content).encode(u'utf-8')
  return ET.XML(xmlstring)

My function clean_up_xml exists to remove the namespace and other chars that were causing me problems.

def clean_up_xml(xml_string):
"""remove the namespace and invalid chars from an xml-string"""
   return re.sub(' xmlns="[^"]+"', '', xml_string, count=1).replace('&', '&')

1 Answer 1

3

You take from res.content a binary string encoded in /most probably/ UTF-8 and encode it into UTF-8 once again. Binary strings should only be decode()'d, Unicode strings should only be encode()'d, except some special cases.

Since clean_up_xml() works with binary strings, it would be better to just pass binary input into ElementTree, it will handle correctly:

xmlstring = clean_up_xml(res.content)
# let ElementTree decode content using information from the XML itself
# e.g. <?xml version="1.0" encoding="UTF-8"?>
return ET.XML(xmlstring)

If you decide to refactor code to work with unicode then all binary inputs should be decoded as soon as possible:

# let requests decode response using information from HTTP header
# e.g. Content-Type: text/xml; charset=utf-16
xmlstring = clean_up_xml(res.text)
return ET.XML(xmlstring)

When asking question related to Unicode it is important to specify Python version, in this case Python 2 with print_function imported from the future. In Python 3 you would see the following:

>>> print('PA_g\xc3\xa9p7')
PA_gép7
>>> 'PA_g\xc3\xa9p7' == u'PA_g\xc3\xa9p7'
True
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much for your answer! You were right, I was encoding where I shouldn't have been!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.