AWK get attribute value from XML element

Question

Hello everyone I am trying to use AWK to extract the version= value from pkg-info from an XML file.

I would like to just do something like:

cat file_below.txt | awk some_commands

using the data below

<?xml version="1.0" encoding="utf-8"?>
<pkg-info overwrite-permissions="true" relocatable="false" identifier="com.application.something" version="1.2.3" format-version="2" generator-version="ABC" install-location="/Applications" auth="root">
</pkg-info>

The desired output would be:

1.2.3

Thank you in advance

David C. Rankin · Accepted Answer · 2021-02-03 00:10:52Z

A simple way is to use sed to locate the line beginning "<pkg-info..." and then isolating the version with a substitution capturing the version and reinserting as a backreference, e.g.

sed -E -n '/^<pkg-info/s/^.*[ ]version="([^"]+)".*$/\1/p' file

Where -E specifies extended regex and -n suppresses normal output of pattern space, and:

/^<pkg-info/ locates the line beginning with "<pkg-info", then the normal
s/find/replace/ substitution where find is:
^.*[ ]version="([^"]+)".*$ ignores characters from the beginning of line to a space followed by version=", the capture group ([^"]+) captures one or more characters that follow that are not a '"' (i.e. the version number you want) and then ".*$ ignores from the closing '"' to end of line.
the replace is \1 which simply inserts the first bacreference (the stuff captured in the first capture group above), and
/p then prints the result.

Example Use/Output

With your example in file you would have:

$ sed -E -n '/^<pkg-info/s/^.*[ ]version="([^"]+)".*$/\1/p' file
1.2.3

Breaks if there is a \n between <pkg-info and version=
@dawg 100% agreed, the explanation makes clear that /^<pkg-info/ is used to locate that specific line. If the wanted version isn't in the line beginning with /^<pkg-info/, then it will not return the wanted version.

RavinderSingh13 · Accepted Answer · 2021-02-03 02:21:37Z

With your shown samples, could you please try following. Written and tested in GNU awk. Also as per experts advise a xml parsing tool would be best to parse xml file since OP is already using awk to parse OP's file so going with it.

awk '
/^<pkg-info/ && match($0,/[[:space:]]+version="([0-9]+\.){2}[0-9]+"[[:space:]]+/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/^ +| +$/,"",val)
  print val
}
' Input_file

Explanation: Adding detailed explanation for above.

awk '                             ##Starting awk program from here.
/^<pkg-info/ && match($0,/[[:space:]]+version="([0-9]+\.){2}[0-9]+"[[:space:]]+/){
                                  ##Checking condition if line starts from <pkg-info AND matches mentioned regex.
  val=substr($0,RSTART,RLENGTH)   ##Creating val which is sub string of matched regex.
  gsub(/^ +| +$/,"",val)          ##Substituting starting and ending spaces with NULL in val.
  print val                       ##Printing val value here.
}
' Input_file                      ##Mentioning Input_file name here.

Does not work if the XML has any \n between pkg-info and version=
@dawg, sure, that's why I had mentioned its clearly written as per shown samples only.

RARE Kpop Manifesto · Accepted Answer · 2021-02-03 17:38:04Z

assuming no newlines within tag

gawk/mawk/mawk2 'BEGIN { FS = "version=\"" } /^[<]pkg-info/ {

    print substr($2, 1, index($2, "\"") -1 ); exit; }'

version to handle random \n

gawk/mawk/mawk2 'BEGIN { FS="version=\"" } (NF > 1) { 
       
    if (seen++) { print substr($2,1,index($2, "\"")-1); exit; } }'

This will skip the first time it sees version, at the initial xml tag. second time it prints the version number then exits. this code does not need to make assumptions regarding how version numbers are formatted, other than being double quoted.

version to account for pkg-info being all over the place :

gawk/mawk/mawk2 'BEGIN { RS = "^$"; FS = "([<]pkg-info|[\/]pkg-info[>])";
   
   } match($2, /version=[^ ]+/) {

       print substr($2, RSTART + 9, RLENGTH - 10); exit; }'

Just have it read in the whole XML file, not attempting to split things along NL. Then when you enforce FS exactly being the opening and close tags of it, then $2 must be the first occurrence of such a tag.

This actually outputs 1.0 from <?xml version="1.0" ... in the first line. Suggest awk '/<pkg-info/ && match($0, /version=[^ ]+/) { ...
Does not work if the XML has any \n between pkg-info and version=
@dawg : created new variant to account for your feedback. does it work ?

dawg · Accepted Answer · 2021-02-03 18:42:13Z

What you have there is an XML Element with an attribute=value combination you wish to get.

While you could have a simple awk or sed that will retrieve 1.2.3 from the one-line example you have, you really should use an XML parser. It will likely not work in the future if you don't.

While you have given the attributes all-on-one-line example of:

<?xml version="1.0" encoding="utf-8"?>
<pkg-info overwrite-permissions="true" relocatable="false" identifier="com.application.something" version="1.2.3" format-version="2" generator-version="ABC" install-location="/Applications" auth="root">
</pkg-info>

The same data could just as easily be:

<?xml version="1.0" encoding="utf-8"?>
<pkg-info overwrite-permissions="true" 
          relocatable="false" identifier="com.application.something" 
          version="1.2.3" format-version="2" 
          generator-version="ABC" install-location="/Applications" auth="root">
</pkg-info>

Or,

<?xml version="1.0" encoding="utf-8"?><pkg-info overwrite-permissions="true" relocatable="false" identifier="com.application.something" version="1.2.3" format-version="2" generator-version="ABC" install-location="/Applications"  auth="root"/>

and still be parsed as the same data. All three examples are valid XML but none of the awk or sed answers here handle any but the first example.

For XML, a '\n', ' ', '\t' and '\r' are all the same¹ but to awk and sed those characters have very different meaning. To try and coerce a line oriented tool like awk or sed to deal with tag oriented data like XML is extremely fragile.

The best way to deal with this is to use an XPath query. The relevant query would be:

/pkg-info/@version

DEMO

Given file that has some valid form of XML as above, you can use one of these methods.

Here is a simple example in Ruby. Use nokogiri xml parser to parse with an xpath to the attribute of interest:

ruby -r nokogiri -e 'doc=Nokogiri::XML($<.read)
puts doc.xpath("/pkg-info").attribute("version").value' file
1.2.3

(You may need to install nokogiri with gem install nokogiri on your system...)

Or with XMLStarlet:

xml sel -t -v '/pkg-info/@version' file
1.2.3

If you have XML::XPath module installed with your Perl, (and most systems do) you also have a command line XPath query tool called xpath. You can do:

xpath -q -e '/pkg-info/@version' file
 version="1.2.3"

Then run that through sed to just get the value:

xpath -q -e '/pkg-info/@version' file | sed -E 's/[^"]*"([^"]*).*/\1/'
1.2.3

Note that an XML parser will work with the any legit version of your XML data. The other sed or awk solutions here will not.

And if your wreally wreally wreally want to use a regex, Perl is a better bet. This works with all three examples above:

perl -0777 -lnE 'say $1 if /(?:\s|>)<pkg-info[\s\S]*?\sversion="([^"]+)"/m' file

If you hafta hafta hafta have an awk you can set RS-"^$" which has the effect of reading the entire file in as one string then:

Find the point with "<pkg-info ".
Since these are attributes and not nested tags, there will be no > in the attribute section. But, no matter how the <pkg-info element is terminated, there must be a > to terminate it.
Now sub everything on either side of the ' version=" value with ""
Print and profit.

This awk works with all my examples; HOWEVER, you really should use an XML parser.

awk -v RS="^$" '{ x=index($0, "<pkg-info ")
                  s=substr($0,x)
                  sub(/[^>]*\sversion="/,"", s)
                  sub(/".*/,"", s)
                  print s
                }' file

¹ So long as those characters are insignificant whitespace, which they are in this example...

Michael Back · Accepted Answer · 2021-03-16 14:25:29Z

Awk and XML are not the best of friends because awk is a regular expression driven line-based tool. XML is not a simple format that can be easily filtered with line based tools; so also, it is difficult to create a regular expression that can reliably match to all the ways XML can be presented.

To make sure we don't make a mistake, we leversge a state machine (a filter) that understands XML to transform it to something line-based that we can work with reliably. One such tool is xml2 which provides a parseable "flat" output from XML. Here is an example of the filtered result of your sample....

$ xml2 < some.xml
/pkg-info/@overwrite-permissions=true
/pkg-info/@relocatable=false
/pkg-info/@identifier=com.application.something
/pkg-info/@version=1.2.3
/pkg-info/@format-version=2
/pkg-info/@generator-version=ABC
/pkg-info/@install-location=/Applications
/pkg-info/@auth=root

After filtering the XML, it is trivial to create a reliable awk or sed filter to grab our output... Here are a couple ideas:

$ xml2 < some.xml | awk -F= '$1 == "/pkg-info/@version" { print $2 }'
1.2.3
$ xml2 < some.xml | sed -e 's,^/pkg-info/@version=,,; t; d'
1.2.3

Collectives™ on Stack Overflow

AWK get attribute value from XML element

5 Answers 5

2 Comments

3 Comments

11 Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

11 Comments

Comments

Comments

Related