How to read an xml file from a folder with python?

Question

I have a an XML file like this:

xml_='''\
<author type="XXX" language="EN" gender="xx" feature="xx" web="foobar.com">
    <documents count="N">
        <document KEY="e95a9a6c790ecb95e46cf15bee517651" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="bc360cfbafc39970587547215162f0db" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="19e71144c50a8b9160b3f0955e906fce" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="21d4af9021a174f61b884606c74d9e42" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="28a45eb2460899763d709ca00ddbb665" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="a0c0712a6a351f85d9f5757e9fff8946" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="626726ba8d34d15d02b6d043c55fe691" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...]
]]>
        </document>
        <document KEY="2cb473e0f102e2e4a40aa3006e412ae4" web="www.foo_bar_exmaple.com"><![CDATA[A large text with lots of strings and punctuations symbols [...] [...]
]]>
        </document>
    </documents>
</author>
'''

Then I placed it into a pandas dataframe like this:

import pandas as pd
import xml.etree.ElementTree as ET

def iter_docs(author):
    author_attr = author.attrib
    for doc in author.iterfind('.//document'):
        doc_dict = author_attr.copy()
        doc_dict.update(doc.attrib)
        doc_dict['data'] = doc.text
        yield doc_dict


etree = ET.fromstring(xml_data) #create an ElementTree object 
doc_df = pd.DataFrame(list(iter_docs(etree)))

I would like to just pass the path of the file instead of creating an xml_data string variable, any idea of how to do this?.

huderlem · Accepted Answer · 2015-02-03 00:34:46Z

4

From the docs: https://docs.python.org/2/library/xml.etree.elementtree.html#parsing-xml

You can do:

etree = ET.parse(filename)
root = etree.getroot()
doc_df = pd.DataFrame(list(iter_docs(root)))

edited Feb 3, 2015 at 0:34

answered Feb 3, 2015 at 0:17

huderlem

2511 silver badge5 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

skwoi Over a year ago

Thanks, I tried this: etree = ET.parse(xml_data)and got this exception: AttributeError: 'ElementTree' object has no attribute 'attrib'. Where xml_data is the path of the file.

huderlem Over a year ago

Is xml_data the name of a file, or is it just the xml string? For example, you need to make a file called my_data.xml that contains your xml. Then, call ET.parse('my_data.xml').

Anzel Over a year ago

@hunderlem, I think you need to add getroot() after parsing in order to work with iterfind()

huderlem Over a year ago

Good call! Editing my answer.

huderlem Over a year ago

I updated my answer with @Anzel's suggestion. Your function iter_docs() was expecting the root of the parsed tree, which I didn't notice.

|

Collectives™ on Stack Overflow

How to read an xml file from a folder with python?

1 Answer 1

7 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Related