Taking data from XML with Python

Question

I'm trying to understand how to pull certain data from an XML file with Python.

At the moment I'm pulling information from an API and getting the XML file but I want to get specific information from within the XML directly.

From what I could find it seems like Element Tree is the answer but I have found it very difficult to understand and am really unsure it's the correct way to create a solution.

I've left below the code I've used to get the XML data, and a shortened XML file it gave me (just left the important parts I need to pull).

Thank you.

import requests


#Import routes
routes=[]



class routesClass:
    def __init__(self,name,url):#,start,end,offset,rwe,al):
        self.n=name
        self.u=url
        #self.s=start
        #self.e=end
        #self.o=offset
        #self.r=rwe
        #self.a=al

#Add example route
testRoute1=routesClass("EasternFwy-Hoddle/Johnston","https://api.tomtom.com/routing/1/calculateRoute/-37.79205923474775,145.03010268799338:-37.798883995180496,145.03040309540322:-37.807106781970354,145.02895470253526:-37.80320743019992,145.01021142594075:-37.7999012967757,144.99318476311566:?routeType=shortest&key=SECRETKEY&computeTravelTimeFor=all")
routes.append(testRoute1)
#routes.append(testRoute2)

print(routes[0].u)

And the XML stuff.

<summary>
<lengthInMeters>5144</lengthInMeters>
<travelTimeInSeconds>764</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2017-12-28T14:42:14+11:00</departureTime>
<arrivalTime>2017-12-28T14:54:58+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>478</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>764</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>764</liveTrafficIncidentsTravelTimeInSeconds>
</summary>
<leg>
<summary>
<lengthInMeters>806</lengthInMeters>
<travelTimeInSeconds>67</travelTimeInSeconds>
<trafficDelayInSeconds>0</trafficDelayInSeconds>
<departureTime>2017-12-28T14:42:14+11:00</departureTime>
<arrivalTime>2017-12-28T14:43:21+11:00</arrivalTime>
<noTrafficTravelTimeInSeconds>59</noTrafficTravelTimeInSeconds>
<historicTrafficTravelTimeInSeconds>67</historicTrafficTravelTimeInSeconds>
<liveTrafficIncidentsTravelTimeInSeconds>67</liveTrafficIncidentsTravelTimeInSeconds>
</summary>

Mr Lister · Accepted Answer · 2018-01-01 21:29:50Z

I recommend lxml. In my opinion it’s easier to navigate through an xml tree than Element Tree. . Here is a demo of how to use the module.

An Example
Taking your xml, this is how I'd parse it using lxml. If you save the code for example.xml and xmlparse.py

example.xml - The XML you provided was malformed.

It didn't have a parent xml tag that grouped the two summary sections.
There was a random <leg> tag in the middle of the two summary sections.

Those two issues wouldn't allow it to parse, so I removed the <leg> tag and grouped the two summary sections within a <parent> tag. Here is the XML.

<parent>
    <summary>
        <lengthInMeters>5144</lengthInMeters>
        <travelTimeInSeconds>764</travelTimeInSeconds>
        <trafficDelayInSeconds>0</trafficDelayInSeconds>
        <departureTime>2017-12-28T14:42:14+11:00</departureTime>
        <arrivalTime>2017-12-28T14:54:58+11:00</arrivalTime>
        <noTrafficTravelTimeInSeconds>478</noTrafficTravelTimeInSeconds>
        <historicTrafficTravelTimeInSeconds>764</historicTrafficTravelTimeInSeconds>
        <liveTrafficIncidentsTravelTimeInSeconds>764</liveTrafficIncidentsTravelTimeInSeconds>
    </summary>
    <summary>
        <lengthInMeters>806</lengthInMeters>
        <travelTimeInSeconds>67</travelTimeInSeconds>
        <trafficDelayInSeconds>0</trafficDelayInSeconds>
        <departureTime>2017-12-28T14:42:14+11:00</departureTime>
        <arrivalTime>2017-12-28T14:43:21+11:00</arrivalTime>
        <noTrafficTravelTimeInSeconds>59</noTrafficTravelTimeInSeconds>
        <historicTrafficTravelTimeInSeconds>67</historicTrafficTravelTimeInSeconds>
        <liveTrafficIncidentsTravelTimeInSeconds>67</liveTrafficIncidentsTravelTimeInSeconds>
    </summary>
</parent>

xmlparse.py - In this script I offer you a loop that prints out the keys (elem.text) and values (text) as well as a logic statement that checks if one of the keys exists and if its value is greater than 700,. This was just to help you appreciate how to add a trigger in the loop.

from lxml import etree

def parseXML(xmlFile):
    """
    Parse the xml
    """
    with open(xmlFile) as fobj:
        xml = fobj.read()

    root = etree.fromstring(xml)

    for appt in root.getchildren():
        for elem in appt.getchildren():
            if not elem.text:
                text = "None"
            else:
                text = elem.text

            ##This is doing something with the xml based on it's tag and value.
            if elem.tag == 'travelTimeInSeconds' and int(text) > 700:
                print('******** Do something with ', elem.tag, ' : ', text)
            print(elem.tag + " => " + text)

if __name__ == "__main__":
    parseXML("example.xml")

Output -- If you save the code for xmlparse.py and save the updated xml I provided in an example.xml file you'll receive the following output when you run the script:

lengthInMeters => 5144
******** Do something with  travelTimeInSeconds  :  764
travelTimeInSeconds => 764
trafficDelayInSeconds => 0
departureTime => 2017-12-28T14:42:14+11:00
arrivalTime => 2017-12-28T14:54:58+11:00
noTrafficTravelTimeInSeconds => 478
historicTrafficTravelTimeInSeconds => 764
liveTrafficIncidentsTravelTimeInSeconds => 764
lengthInMeters => 806
travelTimeInSeconds => 67
trafficDelayInSeconds => 0
departureTime => 2017-12-28T14:42:14+11:00
arrivalTime => 2017-12-28T14:43:21+11:00
noTrafficTravelTimeInSeconds => 59
historicTrafficTravelTimeInSeconds => 67
liveTrafficIncidentsTravelTimeInSeconds => 67

How would you write a script in python to pull away this code in particular?
@MichaelHolborn I updated the answer with a working example for you. I hope this helps.
Brilliant work - Looking over it now and working to understand your solution. Also happy holiday!
The only thing that's got me a bit stumped is that I want to directly parse from a HTML link - So I don't have the file downloaded.
Then I would import urllib.request and modify the open statement in the script above to something like ‘with urllib.request.urlopen(‘yoururl’) as fobj:’. Of course this is dependent on the url request library you use, but hopefully that gives you an idea of how to open a remote xml file retrievable via a url.

Collectives™ on Stack Overflow

Taking data from XML with Python

1 Answer 1

6 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Related