7

My input file is actually multiple XML files appending to one file. (It's from Google Patents). It has below structure:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

Python xml.dom.minidom can't parse this non-standard file. What's a better way to parse this file? I am not below code has good performance or not.

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line
3
  • I downloaded and extracted the zip archive you give in link. I obtained three files: ipgb20110104.xml , ipgb20110104rpt.html , ipgb20110104lst.txt . I found the above extract in none of these three files. Where does your extract come from ? - Also, what kind of exploitation do you want to do of the extract ? Commented Sep 7, 2011 at 16:04
  • @eyquem, it's in the xml file. I just replaced "us-patent-grant" node to "root_node" to make the structure more clear. Commented Sep 8, 2011 at 1:48
  • Thank you. I had understood for the line <root_node>...</root_node> but I wonder the hell how I didn't for the other line; i searched for it too superficially, I think Commented Sep 8, 2011 at 10:40

3 Answers 3

6

Here's my take on it, using a generator and lxml.etree. Extracted information purely for example.

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)

Yields:

DocID:    US-D0629996-S1-20110104
Title:    Glove backhand
Assignee: Blackhawk Industries Product Group Unlimited LLC

DocID:    US-D0629997-S1-20110104
Title:    Belt sleeve
Assignee: None

DocID:    US-D0629998-S1-20110104
Title:    Underwear
Assignee: X-Technology Swiss GmbH

DocID:    US-D0629999-S1-20110104
Title:    Portion of compression shorts
Assignee: Nike, Inc.

DocID:    US-D0630000-S1-20110104
Title:    Apparel
Assignee: None

DocID:    US-D0630001-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630002-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630003-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630004-S1-20110104
Title:    Headwear cap
Assignee: None

DocID:    US-D0630005-S1-20110104
Title:    Footwear
Assignee: Vibram S.p.A.
Sign up to request clarification or add additional context in comments.

3 Comments

I posted a version using generators, but looks like you beat me to it. +1
@MattH How did you know the address h t t p://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip , please ?
@MattH I didn't even think to make "Copy the shortcut" on the hyper-link ! Thank you. Your code is clean, I upvote
2

I'd opt for parsing each chunk of XML separately.

You seem to already be doing that in your sample code. Here's my take on your code:

def parse_xml_buffer(buffer):
    dom = minidom.parseString("".join(buffer))  # join list into string of XML
    # .... parse dom ...

buffer = [file.readline()]  # initialise with the first line
for line in file:
    if line.startswith("<?xml "):
        parse_xml_buffer(buffer)
        buffer = []  # reset buffer
    buffer.append(line)  # list operations are faster than concatenating strings
parse_xml_buffer(buffer)  # parse final chunk

Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.


Update:

Starting from scratch, here's how I would do it (using BeautifulSoup):

#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup

def separated_xml(infile):
    file = open(infile, "r")
    buffer = [file.readline()]
    for line in file:
        if line.startswith("<?xml "):
            yield "".join(buffer)
            buffer = []
        buffer.append(line)
    yield "".join(buffer)
    file.close()

for xml_string in separated_xml("ipgb20110104.xml"):
    soup = BeautifulSoup(xml_string)
    for num in soup.findAll("doc-number"):
        print num.contents[0]

This returns:

D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...

Comments

0

I don't know about minidom, nor much about XML parsing, but I have used XPath to parse XML/HTML. E.g. within the lxml module.

Here you can find some XPath Examples: http://www.w3schools.com/xpath/xpath_examples.asp

1 Comment

The crucial point is that the input files are nonstandard (malformed) Xml files; in particular, several Xml documents in a single file. Is that supported by lxml?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.