3

I am very new in coding in Python, and there is an issue I have been trying to solve for some hours:

I have 1600+ xml files (0000.xml, 0001.xml, etc) need to be parsed in order to do a text mining project.
But an error has occurred, when I have the following code:

from os import listdir, path 
import xml.etree.ElementTree as ET

mypath = '../project/content' 
files = [f for f in listdir(mypath) if f.endswith('.xml')]

for file in files:    
    tree = ET.parse("../project/content/"+file)
    root = tree.getroot()

The error message is the following:

Traceback (most recent call last):

  File "/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-13-cdc3ee6c3989>", line 6, in <module>
    tree = ET.parse("../project/content/"+file)

  File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1196, in parse
    tree.parse(source, parser)

  File "/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 597, in parse
    self._root = parser._parse_whole(source)

  File "<string>", line unknown ParseError: no element found: line 1, column 0

where did I make mistakes?

Also, I want to only extract the text from one element of each xml files, is it sufficient that I simply attach this line to the code? and moreover, how can I save each of the results to txt files?

maintext = root.find("mainText").text

Thank you very much!

4
  • Often if you search for the Exception message (ParseError: no element found: line 1, column 0) , you will find many SO Q&A's which may point you in the right direction. It is possible that the file it is trying to parse is malformed or maybe even an empty file. If you want to just skip those, Catch the error and maybe just print the filename in the except suite then you can look at them later. Commented Feb 24, 2019 at 20:39
  • Instead of printing filenames that produce an Exception, you could also collect them in a list or write them to an error file - with the intent of possibly manually looking at them later. Commented Feb 24, 2019 at 20:45
  • ...is it sufficient that I simply attach this line to the code? - Try it in the shell with some test data.... ... how can I save each of the results to txt files? - Reading and writing files Commented Feb 24, 2019 at 20:52
  • Thank you! I have found that one of my xml file is empty. So after I removed it, it worked without error anymore! Also the element I wanted to extract worked as well like I posted. However, I have problems about saving the result as txt files, as I can only save as xml files output = codecs.open(file,"w","utf-8") output.write(content) output.close() Commented Feb 24, 2019 at 23:03

1 Answer 1

1

The right way to create path elements is using join:

Add print messages to the code before you try and create the tree.

Is the XML you try parse valid?

Once you solve the parsing issue you can use multiprocessing in order to parse many files at the same time.

from os import listdir, path
import xml.etree.ElementTree as ET

mypath = '../project/content'
files = [path.join(mypath, f) for f in listdir(mypath) if f.endswith('.xml')]

for file in files:
    print(file)
    tree = ET.parse(file)
    root = tree.getroot()
Sign up to request clarification or add additional context in comments.

2 Comments

“Use multiprocessing..” good heavens the OP has enough problems getitng basic file read/parse working to not need over-complicating with possibly completely unneccessary use of multiprocessing.
Thank you! the code is working for me! However, I do not know how to save them as txt file instead of xml?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.