It sounds like what you really want to do is parse a sequence of XML trees—maybe more than one in the same file, or maybe there are multiple files, or who knows.
ElementTree can't quite do that out of the box… but you can build something out of it that can.
First, there's the easy way: Just put your own parser in front of etree. If your XML documents are really separated by blank lines, and there are no embedded lines in any document, this is trivial:
lines = []
for line in inputFile:
if not line.strip():
print(lines)
xml = ET.fromstringlist(lines)
print(xml)
lines = []
else:
lines.append(line)
print(lines)
xml = ET.fromstringlist(lines)
print(xml)
If the "outer structure" is more complicated than this—e.g., if each document begins immediately after the other ends, or if you need stateful information to distinguish within-tree blank lines from between-tree ones—then this solution won't work (or, at least, it will be harder rather than easier).
In that case, things get more fun.
Take a look at iterparse. It lets you parse a document on the fly, yielding each element when it gets to the end of the element (and even trimming the tree as you go along, if the tree is too big to fit into memory).
The problem is that when iterparse gets to the end of the file, it will raise a ParseError and abort, instead of going on to the next document.
You can easily detect that by reading the first start element, then stopping as soon as you reach its end. It's a bit more complicated, but not too bad. Instead of this:
for _, elem in ET.iterparse(arg):
print(elem)
You have to do this:
parser = ET.iterparse(arg, events=('start', 'end'))
_, start = next(parser)
while True:
event, elem = next(parser)
if event == 'end':
print(elem)
if elem == start:
break
(You can make that a bit more concise with filter and itertools, but I thought the explicit version would be easier to understand for someone who's never used iterparse.)
So, you can just do that in a loop until EOF, right? Well, no. The problem is that iterparse doesn't leave the read pointer at the start of the next document, and there's no way to find out where the next document starts.
So, you will need to control the file, and feed the data to iterparse. There are two ways to do this:
First, you can create your own file wrapper object that provides all the file-like methods that ET needs, and pass that to ET.iterparse. That way, you can keep track of how far into the file iterparse reads, and then start the next parse at that offset.
It isn't exactly documented what file-like methods iterparse needs, but as the source shows, all you need is read(size) (and you're allowed to return fewer than size bytes, just as a real file could) and close(), so that's not hard at all.
Alternatively, you can drop down a level and use an ET.XMLParser directly. That sounds scary, but it's not that bad—look how short iterparse's source is, and how little of what it's doing you actually need.
Anyway, it comes down to something like this (pseudocode, not tested):
class Target(object):
def __init__(self):
self.start_tag = None
self.builder = ET.TreeBuilder()
self.tree = None
def start(self, tag, attrib):
if self.start_tag is None:
self.start_tag = tag
return self.builder.start(tag, attrib)
def end(self, tag):
ret = self.builder.end(tag, attrib)
if self.start_tag == tag:
self.tree = self.builder.close()
return self.tree
return ret
def data(self, data):
return self.builder.data(data)
def close(self):
if self.tree is None:
self.tree = self.builder.close()
return self.tree
parser = None
for line in inputFile:
if parser is None:
target = Target()
parser = ET.XMLParser(target=target)
parser.feed(line)
if target.tree:
do_stuff_with(target.tree)
parser = None
iteroritertools.groupby), even if it's twice as long that way. Hopefully it's what you want.