1

I have an XML file with multiple elements. I'd like to extract specific attributes for each package element: codepath, name and nativelibarypath.

The system is very basic and has a limited basic linux terminal with bash, awk, grep etc. No extra packages such as xmllint etc are available. So all we have to work with is probably bash, awk, sed and grep.

I'd like in the script, to assign attribute values to named shell variables so I can use them in creating an output file, which is to look like:-

[for each <package> element processed]
..
name:<from name attribute>
path:<from nativelibrarypath attribute>
apk:<from codepath attribute>
...

The XML source is:

<package codepath="/data/app/com.project.t2i-2.apk" flags="0" ft="13a837c2068" it="13a83704ea3" name="com.project.t2i" nativelibrarypath="/data/data/com.project.t2i/lib" userid="10040" ut="13a837c2ecb" version="1">
<sigs count="1">
<cert index="3" key="308201e53082014ea0030201020204506825ae300d06092a86
4886f70d01010505003037310b30090603550406130255533110300e060355040a13074
16e64726f6964311630140603550403130d416e64726f6964204465627567301e170d31
32303933303130353735305a170d3432303932333130353735305a3037310b300906035
50406130255533110300e060355040a1307416e64726f6964311630140603550403130d
416e64726f696420446562756730819f300d06092a864886f70d010101050003818d003
08189028181009ce1c5fd64db794fd787887e8a2dccf6798ddd2fd6e1d8ab04cd8cdd9e
bf721fb3ed6be1d67c55ce729b1e1d32b200cbcfc91c798ef056bc9b2cbc66a396aed6b
a3629a18e4839353314252811412202500f11a11c3bf4eb41b2a8747c3c791c89391443
39036345b15b5e080469ac5f536fd9edffcd52dcbdf88cf43c580abd0203010001300d0
6092a864886f70d01010505000381810071fa013b4560f16640ed261262f32085a51fca
63fa6c5c46fde9a862b56b6d6f17dd49643086a39a06314426ba9a38b784601197246f8
d568e349a93bc6af315455de7a8923f40d4051a51e1658ee34aca41494ab94ce978ae38
609803dfb3004806634e6e78dd0be26fe75843958711935ffc85f9fcf81523ce23c86bc
c5c7a">
</cert></sigs>
<perms>
<item name="android.permission.WRITE_EXTERNAL_STORAGE">
</item></perms>
</package>

Appreciate the purists will balk at this , however with limited toolsets I'm afraid bash/awk is the only viable way. Accept that XML poorly formatted may not be parsed. But as it stands, all elements include the set of attributes always in the same order as above.

I tried this, but it is hopelessly poor...

awk -F '"' '/<package.*?((codepath=)|(name=))+/{print $2}' packages.xml
2
  • 3
    I'm not a purist, I'm a practical engineer who spends his life fixing problems caused by people who take short cuts. Just don't do this. Commented Feb 7, 2019 at 14:56
  • There's nothing in your question so far to suggest that assigning anything to bash variables would be useful vs just doing whatever it is you need to do inside the one awk script that's parsing the input. So, clarify why you feel you need to populate bash variables and edit your question to include the expected output given your posted sample input. Also, if you want a tool to process multiple packages then include multiple packages (i.e. at least 2) in your sample input/output. Commented Feb 7, 2019 at 15:13

1 Answer 1

1

Without showing us the expected output and without input containing multiple packages it's a guess if this is what you want or not but in any case - with any POSIX awk:

$ cat tst.awk
BEGIN {
    OFS=":"
    map["nativelibrarypath"] = "path"
    map["codepath"] = "apk"
    tags[++numTags] = "name"
    tags[++numTags] = "path"
    tags[++numTags] = "apk"
}
$1 == "<package"   { inPkg=1 }
$1 == "</package>" { prtPkg(); inPkg=0 }
inPkg {
    for (i=1; i<=NF; i++) {
        if ( match($i,/^[[:alnum:]_]+=/) ) {
            tag = substr($i,RSTART,RLENGTH-1)
            tag = (tag in map ? map[tag] : tag)
            val = substr($i,RSTART+RLENGTH)
            gsub(/^"|">?$/,"",val)
            tag2val[tag] = val
        }
    }
}
END { prtPkg() }

function prtPkg(        tag, tagNr) {
    if ("name" in tag2val) {
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            print tag, tag2val[tag]
        }
    }
    delete tag2val
}

.

$ awk -f tst.awk file
name:android.permission.WRITE_EXTERNAL_STORAGE
path:/data/data/com.project.t2i/lib
apk:/data/app/com.project.t2i-2.apk

Note that your input has 2 name attributes and you didn't say which one you wanted output. Also your key is multi-line and there's ways to handle that but since you don't want that output I just saved the first part of it from its first line when populating the tag2val array.

Sign up to request clarification or add additional context in comments.

5 Comments

Apologies....it is the 1st name attibute I seek. Yes! all other attributes to be ignored. The input is as above, exactly same sequencing of elements and attributes hence only one <package> shown. Output is as per above also? I will test your code out with a tweek and then try to comprehend it !!!
Ed, could you clarify.... 1. prtPkg has a prototype (tag, tagNr) which infers to me that tag/tagNr are local vars? if not defined here then these would act as global vars in the entire awk program? 2. delete tag2val .. needed because this is a global var and we are re-initialising between invocations from inPkg() to prtPkg() ? 3. What on earth is going on here: match($i,/^[[:alnum:]_]+=/) ) and gsub(/^"|">?$/,"",val)
1) correct. 2) correct. 3) match is finding the tag up to the = in the current field, if present, and the gsub() is removing leading trailing double quotes from val.
Most obliged Ed. Are those both regexp or specific to awk/functions? Also, I only want the first name i.e. the attribute in the package element rather than the name child element. How best would I amend your script? Finally, how many years have you been crafting awk?
1) ^[[:alnum:]_]+= is a regexp and match() and substr() are awk functions. 2) change tag2val[tag] = val to if ( !(tag in tag2val) ) tag2val[tag] = val so the array retains the first values seen for a given tag rather than the last, 2) about 36 years of shell programming, about 26 of them including using awk (the first 10 or so I got by with a mush of sed+grep+etc.).