Need help parsing XML file with python

Question

I have an xml file that list jobs that need to be done. I want to be able to parse it with python. Here is my sample XML file

XML Code:

<?xml version="1.0" encoding="ISO-8859-1"?>

<Jobs>

<Job name="Leo" type="upload">
    <File name="Leo.csv" source="/leegin/leo/OU" destination="/leegin/leo/OU/scripts" archive="/leegin/leo/OU/history" date="1" del="1" stat="1"/>
    <File name="Leo2.csv" source="/leegin/leo/OU" destination="/leegin/leo/OU/scripts" archive="/leegin/leo/OU/history" date="1" del="1" stat="1"/>
    <Log name="Leo.txt" path="/leegin/leo/OU/log"/>
    <Notify name="Leo Cruz" email="[email protected]"/>
    <ftp port="21" proto="0" pasvmode="0" mode="0"/>
</Job>

<Job name="Manny" type="download">
    <File name="Manny.csv" source="/leegin/leo/OU" destination="/leegin/leo/OU/scripts" archive="/leegin/leo/OU/history" date="1" del="1" stat="1"/>
    <File name="Manny2.csv" source="/leegin/leo/OU" destination="/leegin/leo/OU/scripts" archive="/leegin/leo/OU/history" date="1" del="1" stat="1"/>
    <Log name="Manny.txt" path="/leegin/leo/OU/log"/>
    <Notify name="Manny Caparas" email="[email protected]"/>
    <ftp port="21" proto="0" pasvmode="0" mode="0"/>
</Job>

<Job name="Joe" type="copy">
    <File name="Joe.csv" source="/leegin/leo/OU" destination="/leegin/leo/OU/scripts" archive="/leegin/leo/OU/history" date="1" del="1" stat="1"/>
    <File name="Joe2.csv" source="/leegin/leo/OU" destination="/leegin/leo/OU/scripts" archive="/leegin/leo/OU/history" date="1" del="1" stat="1"/>
    <Log name="Joe.txt" path="/leegin/leo/OU/log"/>
    <Notify name="Joe Gomez" email="[email protected]"/>
    <ftp port="21" proto="0" pasvmode="0" mode="0"/>
</Job>

</Jobs>

Python Code:

#!/usr/bin/python2.6

import sys
import optparse

def main():
    desc="""This script is used to setup and run an Automator job."""
    parser = optparse.OptionParser()
    parser.description = desc
    parser.add_option('-j', dest='jobname', type='str', action='store', help='Name of job to execute', metavar='[JobName]')
    parser.add_option('-v', dest='verbose', action='store_true', default=False, help='Used to view scripts debug information.')
    (options, args) = parser.parse_args()

    mandatory_options = ['jobname']
    for m in mandatory_options:
        if not options.__dict__[m]:
            print 'Options -j is required.'
            parser.print_help()
            sys.exit(-1)

    getjob(options.jobname)

def getjob(task):
    from xml.etree import ElementTree
    from xml.etree.ElementTree import Element
    from xml.etree.ElementTree import SubElement

    doc = ElementTree.parse('/opt/automize/template/jobs.xml')

    Files = doc.findall("./Job/File")
    for File in Files:
        print File.attrib['name']

if __name__ == '__main__':  
    main()

Ok so what I am trying to do is to give the python script a job name, then have the script find the job in the XML file and extract only the part that pertains to the specific job.

So far I have been able to build a list of all jobs, or of all files. I have not been able to get it to do this for a specific job though. I would really appreciate some guidance with this matter.

As a side note, IIRC, the stdlib ElementTree in 2.6 is incredibly slow on large files, so if your real data is significantly larger than your sample data, you should use the stdlib's cElementTree instead (or use a non-stdlib implementation). — abarnert
– abarnert, Commented Jan 11, 2013 at 19:56
@J.F.Sebastian: What does that code example show that just dropping the expression in my answer into the OP's original code doesn't? — abarnert
– abarnert, Commented Jan 11, 2013 at 20:21
@abarnert: 1. It is a complete minimal example that OP can test on his Python version. 2. your expression fails on Unicode strings (it is unrelated to the question but since you've asked) — jfs
– jfs, Commented Jan 11, 2013 at 20:28
@J.F.Sebastian: OK, I can see the point of 1—if his version raises an exception, it'll be easier to figure out why in a 3-liner than in his full program. For 2, given that you've changed the encoding of his XML source from ISO-8859-1 to utf-8 without comment, I think it's more likely to cause confusion than anything, even if the OP does know the different rules for %-formatting vs. {}-formatting and mixed string types. — abarnert
– abarnert, Commented Jan 11, 2013 at 20:41

abarnert · Accepted Answer · 2013-01-11 19:48:42Z

The findall method you're using takes a pattern argument, which:

can either be a tag name, or a path expression. If a tag name is given, only direct subelements are checked. Path expressions can be used to search the entire subtree.

If you follow the "path expression" link, you'll see that it's a subset of XPath. So you just need to know the right way to specify your query in XPath terms (or, rather, in the subset of XPath that etree supports).

Your query is asking for all File nodes under all Job nodes. To ask for all File nodes under all Job nodes with the attribute name='Manny', just use Job[@name='Manny'] instead of Job.

So:

doc.findall("./Job[@name='{}']/File".format(task))

Unfortunately, the XPath functionality in etree 1.2 was much more incomplete than in 1.3, and I believe Python 2.6 has 1.2 built in, so this may not work for you. (I believe this will be immediately obvious if it's true—the path pattern compiler will raise an exception telling you that you're using a separator or operator it's never heard of—rather than, e.g., seeming to work but not actually matching anything.)

The obvious solutions are:

Use Python 2.7 (or 3.x) instead of 2.6.
Install 1.3 (see here) and use that instead of the built-in implementation.
Download 1.3 (same link), copy its ElementTree.py and ElementPath.py files into your project, and just import them.
Install lxml and use its implementation instead of the reference implementation.

If there is not enough xpath support and it is desirable to use only stdlib then it easy to check ./Job/@name manually by iterating over Job elements first.
@J.F.Sebastian: True. And by the same token, it's not that hard to just iterate over all nodes instead of using findall in the first place. But why? Given that etree was designed so you, even if you can't install it, you can just drop two files into your project, it's hard to imagine why you'd need to avoid using etree 1.3.

Collectives™ on Stack Overflow

Need help parsing XML file with python

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related