I am trying to parse a very ugly XML file with Python. I manage to get pretty well into it, but at the npdoc element it fails. What am I doing wrong?
XML:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<npexchange xmlns="http://www.example.com/npexchange/3.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="3.5">
<article id="123" refType="Article">
<articleparts>
<articlepart id="1234" refType="ArticlePart">
<data>
<npdoc xmlns="http://www.example.com/npdoc/2.1" version="2.1" xml:lang="sv_SE">
<body>
<p>Lorem ipsum some random text here.</p>
<p>
<b>Yes this is HTML markup, and I would like to keep that.</b>
</p>
</body>
<headline>
<p>I am a headline</p>
</headline>
<leadin>
<p>I am some other text</p>
</leadin>
</npdoc>
</data>
</articlepart>
</articleparts>
</article>
</npexchange>
This is the python code I have so far:
from xml.etree.ElementTree import ElementTree
def parse(self):
tree = ElementTree(file=filename)
for item in tree.iter("article"):
articleParts = item.find("articleparts")
for articlepart in articleParts.iter("articlepart"):
data = articlepart.find("data")
npdoc = data.find("npdoc")
id = item.get("id")
headline = npdoc.find("headline").text
leadIn = npdoc.find("leadin").text
body = npdoc.find("body").text
return articles
What happens is that I get the id out, but the fields that are inside the npdoc element I cannot access. The npdoc variable gets set to None.
Update: Managed to get the elements into variables by using the namespace in the .find() calls. How do I get the value? As it is HTML it does not come out correctly with the .text attribute.
<p>I am a headline</p>in the headline variable, and so on.xmlns:nfor thehttp://www.example.com/npdoc/2.1namespace? Without such a prefix it is difficult to access the elements under this namespace.