how to parse a XML file into a tree in python

Question

***I must use Elementtree for this project, so if you could, please suggest something that utilizes Elementtree

I have a file that kinda looks like this (each separated by a blank line)

<a>
    <b>
       ....
    </b>
    <c>
       ....
    </c>
</a>
<d><c></c></d>

<a>
    <b>
       ....
    </b>
    <c>
       ....
    </c>
</a>
<d><c></c></d>

<a>
    <b>
       ....
    </b>
    <c>
       ....
    </c>
</a>
<d><c></c></d>

I know this is not a valid XML, so what I am trying to do is read the whole thing as a string and add a root element to that, which would end up looking like this for each XML:

<root>
    <a>
        <b>
           ....
        </b>
        <c>
           ....
        </c>
    </a>
    <d><c></c></d>
</root>

I want to know if there is a simple way to read the XML code one by one and concatenate it with a parent node, and do the same for the next XML code, and so on.

Any help would be appreciated, thank you.

It sounds like what you really want is to parse each of the trees, in order, without bundling them into one big tree. Yes? — abarnert
– abarnert, Commented Jun 25, 2013 at 1:42
Exactly, and I can't think of way to do that. I have a multiple of <a> .. </a> <d>...</d> in my input file separated by a blank line. I need to read each chunk one by one, and add a parent node to each. All the functions that python provides seems to read the whole file, which is not what I am trying to do. Maybe read a line one by one and store everything I read until I hit the blank line? — Nayana
– Nayana, Commented Jun 25, 2013 at 1:50
Yes, that last sentence is exactly what I was saying is the "easy way" in my (second) answer. I tried to write it up in a way that's clear even to a novice (no two-argument iter or itertools.groupby), even if it's twice as long that way. Hopefully it's what you want. — abarnert
– abarnert, Commented Jun 25, 2013 at 2:30

abarnert · Accepted Answer · 2013-06-25 02:34:48Z

It sounds like what you really want to do is parse a sequence of XML trees—maybe more than one in the same file, or maybe there are multiple files, or who knows.

ElementTree can't quite do that out of the box… but you can build something out of it that can.

First, there's the easy way: Just put your own parser in front of etree. If your XML documents are really separated by blank lines, and there are no embedded lines in any document, this is trivial:

lines = []
for line in inputFile:
    if not line.strip():
        print(lines)
        xml = ET.fromstringlist(lines)
        print(xml)
        lines = []
    else:
        lines.append(line)
print(lines)
xml = ET.fromstringlist(lines)
print(xml)

If the "outer structure" is more complicated than this—e.g., if each document begins immediately after the other ends, or if you need stateful information to distinguish within-tree blank lines from between-tree ones—then this solution won't work (or, at least, it will be harder rather than easier).

In that case, things get more fun.

Take a look at iterparse. It lets you parse a document on the fly, yielding each element when it gets to the end of the element (and even trimming the tree as you go along, if the tree is too big to fit into memory).

The problem is that when iterparse gets to the end of the file, it will raise a ParseError and abort, instead of going on to the next document.

You can easily detect that by reading the first start element, then stopping as soon as you reach its end. It's a bit more complicated, but not too bad. Instead of this:

for _, elem in ET.iterparse(arg):
    print(elem)

You have to do this:

parser = ET.iterparse(arg, events=('start', 'end'))
_, start = next(parser)
while True:
    event, elem = next(parser)
    if event == 'end':
        print(elem)
        if elem == start:
            break

(You can make that a bit more concise with filter and itertools, but I thought the explicit version would be easier to understand for someone who's never used iterparse.)

So, you can just do that in a loop until EOF, right? Well, no. The problem is that iterparse doesn't leave the read pointer at the start of the next document, and there's no way to find out where the next document starts.

So, you will need to control the file, and feed the data to iterparse. There are two ways to do this:

First, you can create your own file wrapper object that provides all the file-like methods that ET needs, and pass that to ET.iterparse. That way, you can keep track of how far into the file iterparse reads, and then start the next parse at that offset.

It isn't exactly documented what file-like methods iterparse needs, but as the source shows, all you need is read(size) (and you're allowed to return fewer than size bytes, just as a real file could) and close(), so that's not hard at all.

Alternatively, you can drop down a level and use an ET.XMLParser directly. That sounds scary, but it's not that bad—look how short iterparse's source is, and how little of what it's doing you actually need.

Anyway, it comes down to something like this (pseudocode, not tested):

class Target(object):
    def __init__(self):
        self.start_tag = None
        self.builder = ET.TreeBuilder()
        self.tree = None
    def start(self, tag, attrib):
        if self.start_tag is None:
            self.start_tag = tag
        return self.builder.start(tag, attrib)
    def end(self, tag):
        ret = self.builder.end(tag, attrib)
        if self.start_tag == tag:
            self.tree = self.builder.close()
            return self.tree
        return ret
    def data(self, data):
        return self.builder.data(data)
    def close(self):
        if self.tree is None:
            self.tree = self.builder.close()
        return self.tree

parser = None
for line in inputFile:
    if parser is None:
        target = Target()
        parser = ET.XMLParser(target=target)
    parser.feed(line)
    if target.tree:
        do_stuff_with(target.tree)
        parser = None

This is a lot more complicated than I expected it to be, but I guess I will just go ahead and follow your instructions. Thank you. You are amazing!
@Nayana: Hold on, there may be a much simpler way, from the way you described your files. See my latest edit.
Fantastic answer - hope the OP fully appreciates the effort here!
I surely do appreciate your help, abarnert. I might need to spend more time to understand your code, but thank you so much. I will make sure to understand your code.

Jon Clements · Accepted Answer · 2013-06-25 01:27:20Z

3

Just create a string with the root/end root surrounding:

with open('yourfile') as fin:
    xml_data = '<{0}>{1}</{0}>'.format('rootnode', fin.read())

Then use ET.fromstring(xml_data)

answered Jun 25, 2013 at 1:27

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

12 Comments

Nayana Over a year ago

My inputfile name is not constant, it might change. So I am passing in Sys.stdin, when I pass sys.stdin to parse function, it seems to complain (ET.parse(sys.stdin)). Do you have any ideas of how to deal with that?

Jon Clements Over a year ago

@Nayana If you're passing sys.stdin then you don't need an open, and using sys.stdin.read() should work just fine to return a string.

Nayana Over a year ago

I am sorry for being so demanding, I like the way you do it. But as I said in the question, I have a multiple number of the first chunk of XML code separated by a blank line. If I do it the way you did, isn't that going to read the entire code and add a root node to the whole thing, which ends up being an enormous XML code? Is there any easy way to deal with that? I have a couple ideas, but too complicated.

Jon Clements Over a year ago

@Nayana To clarify - you have multiple inputs, and expect to be able to treat each one as though they were all under a parent node?

Nayana Over a year ago

@ Jon Clements Exactly. I need to treat each one as though they were all under a different parent node. So, I will have a multiple of <a> .. </a> <d>...</d> in my input file. I will read one by one and add a parent node to each, and then do some operations on them.

|

abarnert · Accepted Answer · 2013-06-25 02:05:14Z

The problem here is pretty simple.

ET.parse takes a filename (or file object). But you're passing it a list of lines. That's not a filename. The reason you get this error:

TypeError: coercing to Unicode: need string or buffer, list found

… is that it's trying to use your list as if it were a string, which doesn't work.

When you've already read the file in, you can use ET.fromstring. However, you have to read it into a string, not a list of strings. For example:

def readXML (inputFile) : #inputFile is sys.stdin
    f= '<XML>' + inputFile.read() + '</XML>'
    newXML = ET.fromstring(f)
    print newXML.getroot().tag

Or, if you're using Python 3.2 or later, you can use ET.fromstringlist, which takes a sequence of strings—exactly what you have.

From your side issue:

Another problem that I just realized while typing this is that my input file has multiple inputs. Say, at least more than 10 of the first XML that I wrote. If I do readlines(), isn't that going to read the whole XML ?

Yes, it will. There's never any good reason to use readlines().

But I'm not sure why that's a problem here.

If you're trying to combine a forest of 10 trees into one big tree, you pretty much have the read the whole thing in, right?

Unless you change the way you do things. The easy way to do this is to put your own trivial parser—something that splits the file on blank lines—in front of ET. For example:

while True:
    lines = iter(inputFile.readline, '')
    if not lines:
        break
    xml = ET.fromstringlist(lines)
    # do stuff with this tree

My inputfile name is not constant, it might change. So I am passing in Sys.stdin, when I pass sys.stdin to parse function, it seems to complain (ET.parse(sys.stdin)). Do you have any ideas of how to deal with that?
First, why not just pass an input filename as, say, sys.argv[1], instead of requiring the input to be in sys.stdin? Or use fileinput to allow either? Second, you're asking for help debugging code which isn't the code you've posted, with an error that you've only loosely described, and that's impossible to debug. Either edit your question, file a new question, or post everything somewhere like pastebin.com and give us links.
you are AMAZING!!!! This is what I was looking for! I am going to try this now. Thank you so much!!!!
@Nayana: If you're looking at the code at the end here, see the slightly different code at the top of my other answer, which is (a) tested and (b) probably easier to understand.

Community · Accepted Answer · 2017-05-23 12:05:26Z

0

You have multiple xml fragments that are separated by a blank line. To make each fragment a well-formed xml document you need at least to wrap them in a root element. Building on fromstringlist code example from @abarnert's answer:

from xml.etree.cElementTree import XMLParser

def parse_multiple(lines):
    for line in lines:
        parser = XMLParser()
        parser.feed("<root>")      # start of xml document
        while line.strip():        # while non-blank line
            parser.feed(line)      # continue xml document
            line = next(lines, "") # get next line
        parser.feed("</root>")     # end of xml document
        yield parser.close() # yield root Element of the xml tree

It yields xml trees (their root elements).

Example:

import sys
import xml.etree.cElementTree as etree

for root in parse_multiple(sys.stdin):
    etree.dump(root)

edited May 23, 2017 at 12:05

CommunityBot

11 silver badge

answered Jun 25, 2013 at 4:35

jfs

417k210 gold badges1k silver badges1.7k bronze badges

3 Comments

abarnert Over a year ago

I don't think you need to add the <root> element here. The only reason the OP was doing that was to combine all of the trees together into one big tree. If he can parse the trees one by one (and he can), he doesn't need any spurious new elements.

jfs Over a year ago

<a> and <d> go to the same blank-line-separated tree. There are 3 trees in the example input in the question. @Nayana should clarify whether it is indeed the desired outcome.

abarnert Over a year ago

Good point. I assumed that the explanation was right and the initial example was wrong, but the opposite is certainly possible, in which case he's actually got not a forest of blank-line-separated trees, but a forest of blank-line-separated subforests, in which case he needs both workarounds…

Collectives™ on Stack Overflow

how to parse a XML file into a tree in python

4 Answers 4

4 Comments

12 Comments

4 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

12 Comments

4 Comments

3 Comments

Related