3

I have several large .xml files. I want to parse out the files to do several things.

I want to pull out only:

  • XML-/title1 and save it to list A (for example)
  • XML-/title2 and save it to list B
  • XML-/title3 and save it to list C
  • etc, etc

Using Python 2.x which library would be best to import/use. How would I set this up? Any Suggestions?

For Example:

 <PubmedArticle>
    <MedlineCitation Owner="NLM" Status="MEDLINE">
        <PMID Version="1">8981971</PMID>
        <Article PubModel="Print">
            <Journal>
                <ISSN IssnType="Print">0002-9297</ISSN>
                <JournalIssue CitedMedium="Print">
                    <Volume>60</Volume>
                    <Issue>1</Issue>
                    <PubDate>
                        <Year>1997</Year>
                        <Month>Jan</Month>
                    </PubDate>
                </JournalIssue>
                <Title>American journal of human genetics</Title>
                <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
            </Journal>
            <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
            <Pagination>
                <MedlinePgn>241-4</MedlinePgn>
            </Pagination>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Scozzari</LastName>
                    <ForeName>R</ForeName>
                    <Initials>R</Initials>
                </Author>
            </AuthorList>
        <MeshHeadingList>
            <MeshHeading>
                <DescriptorName MajorTopicYN="N">Alleles</DescriptorName>
            </MeshHeading>
            <MeshHeading>
                <DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName>
            </MeshHeading>
        </MeshHeadingList>
        <OtherID Source="NLM">PMC1712541</OtherID>
    </MedlineCitation>
</PubmedArticle>
1
  • 1
    I'd use xml.dom.minidom for this, it comes with Python and works fine. lxml is another good library but you'd have to install it. Commented Feb 28, 2012 at 18:49

5 Answers 5

5

Try look at the lxml module.

To locate the titles you can use Xpath with lxml, or you can use the xml object structure in lxml to "index" you down to the title elements.

Sign up to request clarification or add additional context in comments.

Comments

2

Try using Beautiful soup. I have found this library very handy. And as just pointed out, BeautifulStoneSoup specifically is for parsing XML.

4 Comments

Specifically, BeautifulStoneSoup
Thanks ALL, I choose BeautifulSoup as my route. I found that B.S. documenation was just clearer than that of lxml.
In the new version's documentation: There is no longer a BeautifulStoneSoup class for parsing XML. To parse XML you pass in “xml” as the second argument to the BeautifulSoup constructor.
I haven't read the new documentation. Feel free to edit my answer. Thanks
2

I'm not sure why you want each title in its own list, which your question leads me to believe.

How about all the titles in one list? The following example uses a trimmed version of your sample XML, plus I duplicated an <Article/> to show that using lxml.etree.xpath creates the list of <Title/>'s for you:

>>> import lxml.etree

>>> xml_text = """<PubmedArticle>
  <MedlineCitation Owner="NLM" Status="MEDLINE">
    <PMID Version="1">8981971</PMID>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">0002-9297</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American journal of human genetics</Title>
        <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
    <Article PubModel="Print">
      <Journal>
        <ISSN IssnType="Print">9297-0002</ISSN>
        <!-- <JournalIssue ... /> -->
        <Title>American Journal of Pediatrics</Title>
        <ISOAbbreviation>Am. J. Ped.</ISOAbbreviation>
      </Journal>
      <ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle>
      <!--<Pagination>
          ...
          </MeshHeadingList>-->
      <OtherID Source="NLM">PMC1712541</OtherID>
    </Article>
  </MedlineCitation>
</PubmedArticle>"""

XPaths are made to return nodes which lxml.etree.xpath translates into a Python list of node objects:

>>> xml_obj = lxml.etree.fromstring(xml_text)
>>> for title_obj in xml_obj.xpath('//Article/Journal/Title'):
        print title_obj.text 

American journal of human genetics
American Journal of Pediatrics

EDIT 1: Now with Python's xml.etree.ElementTree

I wanted to show this solution with the included module in case installing a third party module is not possible or unattractive.

>>> import xml.etree.ElementTree as ETree
>>> element = ETree.fromstring(xml_text)
>>> xml_obj = ETree.ElementTree(element)
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'):
    print title_obj.text


American journal of human genetics
American Journal of Pediatrics

It's small, but this XPath is not identical to the XPath in the lxml example: there is a period ('.') at the beginning. Without the period, I got this warning (with Python 2.7.2):

>>> xml_obj.findall('//Article/Journal/Title')

Warning (from warnings module):
  File "__main__", line 1
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version.  If you rely on the current behaviour, change it to './/Article/Journal/Title'

1 Comment

I finally got around to looking at all the answers posted. Thanks for the effort! I was able to install the lxml library no problem but I had a hellava time going thru the documentation. I just couldn't wrap my head around, at the time. I found BeautifulSoup's docs little easier to deal with.
1

Try lxml with xpath expressions.

A short snippet

>>> from lxml import etree
>>> xml = """<foo><bar/>baz!</foo>"""
>>> doc = etree.fromstring(xml)
>>> doc.xpath('//foo/text()') #xpath expr
['baz!']
>>>

if you have a xml file than

s = StringIO(xml)
doc = etree.parse(s)

You can use Firebug addon to fetch the xpath expr.

Comments

-1

ElementTree is awesome and comes with Python.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.