in the process of parsing xml from a website, I've managed to get lost in a bunch of utf-8 encoding issues. Specifically, I have strings that look like:
u'PA_g\xc3\xa9p7'
When I print this I get:
>> PA_gép7
What I want instead comes from the following
print('PA_g\xc3\xa9p7')
>> PA_gép7
Here is my code:
def get_api_xml_response(base_url, query_str):
"""gets xml from api @ base_url using query_str"""
res = requests.get(u'{}{}'.format(base_url, query_str))
xmlstring = clean_up_xml(res.content).encode(u'utf-8')
return ET.XML(xmlstring)
My function clean_up_xml exists to remove the namespace and other chars that were causing me problems.
def clean_up_xml(xml_string):
"""remove the namespace and invalid chars from an xml-string"""
return re.sub(' xmlns="[^"]+"', '', xml_string, count=1).replace('&', '&')