Python to parse non-standard XML file

Question

My input file is actually multiple XML files appending to one file. (It's from Google Patents). It has below structure:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

Python xml.dom.minidom can't parse this non-standard file. What's a better way to parse this file? I am not below code has good performance or not.

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line

I downloaded and extracted the zip archive you give in link. I obtained three files: ipgb20110104.xml , ipgb20110104rpt.html , ipgb20110104lst.txt . I found the above extract in none of these three files. Where does your extract come from ? - Also, what kind of exploitation do you want to do of the extract ? — eyquem
– eyquem, Commented Sep 7, 2011 at 16:04
@eyquem, it's in the xml file. I just replaced "us-patent-grant" node to "root_node" to make the structure more clear. — Stan
– Stan, Commented Sep 8, 2011 at 1:48
Thank you. I had understood for the line <root_node>...</root_node> but I wonder the hell how I didn't for the other line; i searched for it too superficially, I think — eyquem
– eyquem, Commented Sep 8, 2011 at 10:40

MattH · Accepted Answer · 2011-09-07 15:44:47Z

Here's my take on it, using a generator and lxml.etree. Extracted information purely for example.

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)

Yields:

DocID:    US-D0629996-S1-20110104
Title:    Glove backhand
Assignee: Blackhawk Industries Product Group Unlimited LLC

DocID:    US-D0629997-S1-20110104
Title:    Belt sleeve
Assignee: None

DocID:    US-D0629998-S1-20110104
Title:    Underwear
Assignee: X-Technology Swiss GmbH

DocID:    US-D0629999-S1-20110104
Title:    Portion of compression shorts
Assignee: Nike, Inc.

DocID:    US-D0630000-S1-20110104
Title:    Apparel
Assignee: None

DocID:    US-D0630001-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630002-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630003-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630004-S1-20110104
Title:    Headwear cap
Assignee: None

DocID:    US-D0630005-S1-20110104
Title:    Footwear
Assignee: Vibram S.p.A.

I posted a version using generators, but looks like you beat me to it. +1
@MattH How did you know the address h t t p://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip , please ?
@MattH I didn't even think to make "Copy the shortcut" on the hyper-link ! Thank you. Your code is clean, I upvote

Shawn Chin · Accepted Answer · 2011-09-07 16:17:15Z

I'd opt for parsing each chunk of XML separately.

You seem to already be doing that in your sample code. Here's my take on your code:

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("<?xml "):
        parse_xml_buffer(buffer)
        buffer = []  # reset buffer
    buffer.append(line)  # list operations are faster than concatenating strings
parse_xml_buffer(buffer)  # parse final chunk

Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.

Update:

Starting from scratch, here's how I would do it (using BeautifulSoup):

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

def separated_xml(infile):
    file = open(infile, "r")
    buffer = [file.readline()]
    for line in file:
        if line.startswith("<?xml "):
            yield "".join(buffer)
            buffer = []
        buffer.append(line)
    yield "".join(buffer)
    file.close()

for xml_string in separated_xml("ipgb20110104.xml"):
    soup = BeautifulSoup(xml_string)
    for num in soup.findAll("doc-number"):
        print num.contents[0]

This returns:

D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...

naeg · Accepted Answer · 2011-09-07 14:30:35Z

0

I don't know about minidom, nor much about XML parsing, but I have used XPath to parse XML/HTML. E.g. within the lxml module.

Here you can find some XPath Examples: http://www.w3schools.com/xpath/xpath_examples.asp

answered Sep 7, 2011 at 14:30

naeg

4,0123 gold badges26 silver badges29 bronze badges

1 Comment

O. R. Mapper Over a year ago

The crucial point is that the input files are nonstandard (malformed) Xml files; in particular, several Xml documents in a single file. Is that supported by lxml?

Collectives™ on Stack Overflow

Python to parse non-standard XML file

3 Answers 3

3 Comments

Update:

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Update:

Comments

1 Comment

Linked

Related