I have a XML file (test.xml) that can be sum up as follow (I filtered it so it will be more readable):
<coverage complexity="0" line-rate="0.66" lines-covered="66" lines-valid="100">
<packages>
<package complexity="0" line-rate="0.66" name=".">
<classes>
<class complexity="0" name="file_a.py" line-rate="0.7674">
<class complexity="0" name="file_b.py" line-rate="0.2727">
<class complexity="0" name="file_c.py" line-rate="1">
</classes>
</package>
</packages>
</coverage>
For each line, I want to extract both name and line-rate info, for example output could be:
. 0.66
file_a.py 0.7674
file_b.py 0.2727
file_c.py 1
Note that I'll like to skip the 1st line since it as no name field.
Right now I managed to get that output with the following bash script:
#!/bin/bash
# Extract info in lines containing either "<package " or "<class "
linerates=`grep '<package \|<class ' test.xml | awk -F "line-rate=" '{print $2}' | awk -F '"' '{$
names=`grep '<package \|<class ' test.xml | awk -F "name=" '{print $2}' | awk -F '"' '{print $2}$
# Transform to array
linerates=(${linerates// / })
names=(${names// / })
# Print table
for i in "${!names[@]}"
do
echo ${names[$i]} ${linerates[i]}
done
Since the code is quite ugly, I wonder if there is a way to extract those two informations in a more elegant way, let say in one command line / without the need to use a for loop
Edit
I switch to python and got this:
from bs4 import BeautifulSoup as bs
with open('test.xml', 'r') as file:
content = file.readlines()
content = "".join(content)
bs_content = bs(content, 'lxml')
list_ = list(bs_content.find('classes').children)
list_ = list(filter(lambda a: a != '\n', list_))
for c in list_:
print(c.get('name'), c.get('line-rate'))
The output is a bit reduced (but I'm OK with it)
file_a.py 0.7674
file_b.py 0.2727
file_c.py 1
I am still looking to do it using a single command line but for now I will go with the python version
Edit (following greybeard's comment)
I filtered my XML file to remove all unnecessary lines (none of them have attributes
namenorline-rate). E.g of removed lines:<lines> <line hits="1" number="1"/> </lines>Not much complications my file is generated so the attributes should always be in the same order. Coverage, package and class have more attributes. E.g. for "coverage" also has a timestamp and a version attributes ; "class" has a filename attribute which is the same as
name
Feel free to ask if I forgot some other information
awk- decades ago, that would have been my first attempt; I'm confident it still is a good fit - as far as the task is defined above. There may be extensions in the wings, complications like attributes changing position: tell (alt least) as much as necessary for a helpful review. \$\endgroup\$classtags are not terminated. \$\endgroup\$