Python remove duplicate elements from xml tree

Question

I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the remove function seems not to apply.

My XML Structure looks simplified like this:

<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub not unique</text><!-- line should be removed -->
    <text>2nd blabla blub again unique</text>
  </page>
</root>

I want to remove double strings on each page, so I'm iterating over pages and over elements in page in two for loops: (extract of important lines, I hope didn't forget anything)

import xml.etree.ElementTree as ET
self.tree = ET.parse(path)
self.root = self.tree.getroot()
self.prev = None
# [...]
for page in self.root:                     # iterate over pages
    for elem in page:
        if elements_equal(elem, self.prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            page.remove(elem) # <---- removes just one line
            continue
        self.prev = elem
# [...]
self.tree.write("out.xml") # 2 duplicate lines still there....

update: The code seems to work, but it removes just one duplicate, not all

i think it's a list ; if so, try making it a set and see if the duplicates are removed. I guess it boils down on how the eq method is implemented for a node, if at all — omu_negru
– omu_negru, Commented Dec 18, 2014 at 15:14
it should be an element of xmlTree Object, but I have no glue how it is implemented. When I try to remove from root it says: ValueError: list.remove(x): x not in list — Karl Adler
– Karl Adler, Commented Dec 18, 2014 at 15:23
how to make it a set? What do you mean by eq method? @omu_negru — Karl Adler
– Karl Adler, Commented Dec 18, 2014 at 15:25
well, just doing set(your_list) , or any iterator for that matter , should do the trick. To check if the eq method is properly implemented, get the second and third nodes and see if second == third returns true (it should) — omu_negru
– omu_negru, Commented Dec 18, 2014 at 15:29

Chance Shaffer · Accepted Answer · 2019-03-15 15:37:46Z

I don't know how you've defined elements_equal, but (shamelessly adapted from Testing Equivalence of xml.etree.ElementTree) this works for me:

EDIT: store a list of each element to be removed whilst iterating over page and then remove them rather than doing the removal within one loop.

EDIT: Noticed a small typo in the code in the comparison of the element tags and correct it.

import xml.etree.ElementTree as ET

path = 'in.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("out.xml")

Gives:

$ python undupe.py
found duplicate: blabla blub not unique
found duplicate: 2nd blabla blub not unique
$ cat out.xml
<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub again unique</text>
  </page>

Equals function works proper, that's not the problem. But I have a similar, thanks.
@abimelex: then there is something else I don't understand about your problem. I've posted my full working code in case it helps.
haha ok, I was thinking in wrong direction... the program works and also your code with the example. The Problem comes with my example, when we have not just two unique elements. Like 4 times the same row. Both codes are somehow not deleting all duplicates... don't know why... @xnx updated my question
Ah yes - it's not working because you're (we're) removing the text elements within the iteration over page's children. This won't work because removing the element will bring the iteration to a premature halt
oh okay... I don't get the behavior reason, but I figured out a solution... I will just edit your answer and then accept it if it's okay for you.

Collectives™ on Stack Overflow

Python remove duplicate elements from xml tree

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related