Extract multiple attributes values in XML tags

Question

I have a XML file (test.xml) that can be sum up as follow (I filtered it so it will be more readable):

<coverage complexity="0" line-rate="0.66" lines-covered="66" lines-valid="100">
    <packages>
        <package complexity="0" line-rate="0.66" name=".">
            <classes>
                <class complexity="0" name="file_a.py" line-rate="0.7674">
                <class complexity="0" name="file_b.py" line-rate="0.2727">
                <class complexity="0" name="file_c.py" line-rate="1">
            </classes>
        </package>
    </packages>
</coverage>

For each line, I want to extract both name and line-rate info, for example output could be:

. 0.66
file_a.py 0.7674
file_b.py 0.2727
file_c.py 1

Note that I'll like to skip the 1st line since it as no name field.

Right now I managed to get that output with the following bash script:

#!/bin/bash

# Extract info in lines containing either "<package " or "<class "
linerates=`grep '<package \|<class ' test.xml | awk -F "line-rate=" '{print $2}' | awk -F '"' '{$
names=`grep '<package \|<class ' test.xml | awk -F "name=" '{print $2}' | awk -F '"' '{print $2}$

# Transform to array
linerates=(${linerates// / })
names=(${names// / })

# Print table
for i in "${!names[@]}"
do
        echo ${names[$i]} ${linerates[i]}
done

Since the code is quite ugly, I wonder if there is a way to extract those two informations in a more elegant way, let say in one command line / without the need to use a for loop

Edit

I switch to python and got this:

from bs4 import BeautifulSoup as bs

with open('test.xml', 'r') as file:
    content = file.readlines()
    content = "".join(content)
    bs_content = bs(content, 'lxml')

list_ = list(bs_content.find('classes').children)
list_ = list(filter(lambda a: a != '\n', list_))

for c in list_:
    print(c.get('name'), c.get('line-rate'))

The output is a bit reduced (but I'm OK with it)

file_a.py 0.7674
file_b.py 0.2727
file_c.py 1

I am still looking to do it using a single command line but for now I will go with the python version

Edit (following greybeard's comment)

I filtered my XML file to remove all unnecessary lines (none of them have attributes name nor line-rate). E.g of removed lines:
```
<lines>
     <line hits="1" number="1"/>
</lines>
```
Not much complications my file is generated so the attributes should always be in the same order. Coverage, package and class have more attributes. E.g. for "coverage" also has a timestamp and a version attributes ; "class" has a filename attribute which is the same as name

Feel free to ask if I forgot some other information

You tagged awk - decades ago, that would have been my first attempt; I'm confident it still is a good fit - as far as the task is defined above. There may be extensions in the wings, complications like attributes changing position: tell (alt least) as much as necessary for a helpful review. — greybeard
– greybeard, Commented Apr 20, 2020 at 20:49
Is the actual XML guaranteed to have at most one line-rate/name pair per line, and those two attributes always on the same line as each other? — Reinderien
– Reinderien, Commented Apr 22, 2020 at 4:06
It's also worth noting that the XML you posted is malformed, because the class tags are not terminated. — Reinderien
– Reinderien, Commented Apr 22, 2020 at 4:10
Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers. — Mast
– Mast ♦, Commented Apr 22, 2020 at 12:56

Reinderien · Accepted Answer · 2020-04-23 15:24:07Z

There are many issues with your posted example that prevent it from being sanely parsed by XML, including lack of closing tags and lack of a starting xml tag. You say that the content is generated: if you generated it, you should try to fix this. Anyway.

import re

pat = re.compile('name="(.*?)".*'
                 'line-rate="([0-9.]*)"')

with open('test.xml') as f:
    for line in f:
        match = pat.search(line)
        if match:
            print(match.expand(r'\1 \2'))

This makes many assumptions:

The attributes actually are in the same order every time (in your example, they aren't, despite you saying that they should be)
The file is guaranteed to have at most one line-rate/name pair per line
Those two attributes are always on the same line as each other

If all of those conditions are satisfied, the above works.

Actual XML

If (as it seems you suggested in a now-rolled-back edit) your input can actually be valid XML, then a method that will perform more validation is

from xml.sax import parse, ContentHandler

class Handler(ContentHandler):
    def startElement(self, name, attrs):
        name, rate = attrs.get('name'), attrs.get('line-rate')
        if name and rate:
            print(f'{name} {rate}')

parse('test.xml', Handler())

I do not know if it is faster, but it's probably a better idea to do this and halt-and-catch-fire if the XML is malformed, which SAX will do for you.

Thanks! I made some modifications in the question. Concerning your assumptions they are OK, and it works perfectly ! (I get the reduce output since conditions are not reunite for package attributes to be considered, but I'm OK with that) — Nuageux
– Nuageux, Commented Apr 22, 2020 at 12:47

Stack Exchange Network

Extract multiple attributes values in XML tags

1 Answer 1

Actual XML

You must log in to answer this question.

Hot Network Questions

Extract multiple attributes values in XML tags

1 Answer 1

Actual XML

You must log in to answer this question.

Related

Hot Network Questions