Awk to get the attribute value from XML file

Question

For getting the attribute value from the below mentioned xml for attribute code from tag c

random.xml

<a>
    <b>
        <c id="123" code="abc" date="12-12-2022"/>
        <c id="123" code="efg" date="12-12-2022"/>
        <c id="123" date="12-12-2022"/>
    </b>
</a>

Currently the logic is:

cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f' RS='"'

How does the above logic work to get the values of code from tag c?

Getting the expected output:

abc
efg

Just for clarification: So the question boils down that you have in awk a field containing a string of the form foo="bar", and you want to extract bar from it. Is this understanding correct? — user1934428
– user1934428, Commented Jan 10, 2023 at 12:00
In this case, it is solely question about awk programming and the xml context is irrelevant. May I suggest that you edit the question then so that it reflects only the problem at hand, dropping the whole xml stuff. Having said this: As you can see from the awk man page, the functions split, or alternatively substr (perhaps combined with index) would be an obvious solution. Did you consider using one of them? — user1934428
– user1934428, Commented Jan 10, 2023 at 14:33

Daweo · Accepted Answer · 2023-01-10 12:38:59Z

Firstly observe that

cat random.xml | egrep "<c.*/>" | awk -F1 ' /code=/ {f=NR} f&&NR-1==f'  RS='"'

is of dubious quality, as

egrep does not require standard input, it can read file itself, so you have useless use of cat
simple pattern is used in egrep which will work equally well in common grep, no need to summon ehanced grep, this usage is overkill
1 is set as field separator in awk, but code does not make any use of fields mechanism

after fixing these issue code looks following way

grep "<c.*/>" random.xml | awk ' /code=/ {f=NR} f&&NR-1==f'  RS='"'

How it does work: select lines which contain <c followed by zero-or-more any characters followed by />, then instruct awk that row are separated by qoutes (") when row does contain code= set f variable value to number of row, print such row that f is set to non-zero value and f value is equal to current number of lines minus one, which does mean print rows which are directly after row containing code=.

Observe that GNU AWK is poorly suited for working with XML and using regular expression against XML is very poor idea, as XML is not Chomsky Type 3 contraption.

If possible use proper tools for working with XML data, e.g. hxselect might be used following way, let file.xml content be

<a>
    <b>
        <c id="123" code="abc" date="12-12-2022"/>
        <c id="123" code="efg" date="12-12-2022"/>
        <c id="123" date="12-12-2022"/>
    </b>
</a>

then

hxselect -c -s '\n' 'c[code]::attr(code)' < file.xml

gives output

abc
efg

Explanation: -c get just value rather than name and value, -s '\n' shear using newline, i.e. each value will be on own line c[code] is CSS3 selector meaning any c tag with attribute code, ::attr(code) is hxselect feature meaning get attribute named code. Observe that this solution is more robust than peculiar cat-egrep-awk pipeline as is immune to e.g. other whitespace usage in file (whitespaces outside tags in XML are optional).

Fravadona · Accepted Answer · 2023-01-10 12:26:12Z

2

This might be an awk question but parsing XML should be done with XML tools.

Here's an example with Xidel (available here for a few OSs) and a standard XPath expression:

xidel --xpath '//c[@code]/@code' random.xml

^{note: //c[@code] selects the c nodes that have a code attribute, and .../@code outputs the value of the code attribute.}

Output

abc
efg

edited Jan 10, 2023 at 12:26

answered Jan 10, 2023 at 11:02

Fravadona

17.6k1 gold badge29 silver badges50 bronze badges

Comments

blhsing · Accepted Answer · 2023-01-10 09:59:21Z

0

If your input always looks likes the sample XML then you can make the code attribute itself a field separator, and < the record separator, so that you can easily extract the value as the second field when the first field is the tag name c:

awk -F' .*code="|" ' -vRS='<' '$1=="c"{print $2}'

Demo: https://awk.js.org/?snippet=Lz6yx7

answered Jan 10, 2023 at 9:59

blhsing

109k9 gold badges88 silver badges132 bronze badges

Collectives™ on Stack Overflow

Awk to get the attribute value from XML file

3 Answers 3

Comments

Output

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Output

Comments

Comments

Related