My input file is actually multiple XML files appending to one file. (It's from Google Patents). It has below structure:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
Python xml.dom.minidom can't parse this non-standard file. What's a better way to parse this file? I am not below code has good performance or not.
for line in infile:
if line == '<?xml version="1.0" encoding="UTF-8"?>':
xmldoc = minidom.parse(XMLstring)
else:
XMLstring += line