bash - select columns based on values

Question

I am new to bash and have the below requirement:

I have a file as below:

col1,col2,col3....col25
s1,s2,s2..........s1
col1,col2,col3....col25
s3,s2,s2..........s2

If you notice the values of these columns can be of 3 types only: s1,s2,s3

I can extract the last 2rows from the given file which gives me:

col1,col2,col3....col25
s3,s1,s2..........s2

I want to further parse the above lines so that I get only the columns with say value s1.

Desired output: say col3,col25 are the only columns with value s2, then say a comma separated value is also fine ex:

col3,col25

Can someone please help?

P.S. I found many examples where a file parsed based on the value of say 2nd (fixed) column, but how do we do it when the column number is not fixed? Checked URLs: awk one liner select only rows based on value of a column

say col2,col4 are the only columns with value s1, then say a comma separated value is also fine ex: col2, col4 — learner
– learner, Commented Sep 4, 2017 at 12:56
@learner, update your question with the desired result set; in this case you'll probably want to update your sample data since nothing (at the moment) shows contents for col4 (sure, we can imagine what the sample data looks like but it wouldn't hurt to show it in your question) — markp-fuso
– markp-fuso, Commented Sep 4, 2017 at 13:15

markp-fuso · Accepted Answer · 2017-09-04 13:58:05Z

Assumptions:

there are 2 input lines
each input line has the same number of comma-separated items

We can use a couple arrays to collect the input data, making sure to use the same array indexes. Once the data is loaded into arrays we loop through the array looking for our value match.

$ cat col.awk
  /col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i }        }
END {
sep=""
for (i=1; i<=n; i++)
    { if (arr_s[i]==smatch)
         { printf "%s%s" ,sep,arr_c[i]
           sep=", "
         }
    }
}

/col1/ : for the line that contains col1, store the fields in array arr_c
n=NF : grab our max array index value (NF=number of fields)
! /col1/ : for line that does not contain col1, store the fields in array arr_s
END ... : executed once the arrays have been loaded
sep="" : set our initial output separator to a null string
for (...) : loop through our array indexes (1 to n)
if (arr_s[i]==smatch) : if the s array value matches our input parameter (smatch - see below example), then ...
printf "%s%s",sep,arr_c[i] : printf our sep and the matching c array item, then ...
sep=", " : set our separator for the next match in the loop

We use printf because without specifying '\n' (a new line), all output goes to one line.

Example:

$ cat col.out
col1,col2,col3,col4,col5
s3,s1,s2,s1,s3
$ awk -F, -f col.awk smatch=s1 col.out                                                                                           
col2, col4

-F, : define the input field separator as a comma
here we pass in our search pattern s1 in the array variable named smatch, which is referenced in the awk code (see col.awk - above)

If you want to do the whole thing at the command line:

$ awk -F, '
  /col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i }        }
END {
sep=""
for (i=1; i<=n; i++)
    { if (arr_s[i]==smatch)
         { printf "%s%s" ,sep,arr_c[i]
           sep=", "
         }
    }
}
' smatch=s1 col.out
col2, col4

Or collapsing the END block to a single line:

awk -F, '
  /col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i }        }
END { sep="" ; for (i=1; i<=n; i++) { if (arr_s[i]==smatch) { printf "%s%s" ,sep,arr_c[i] ; sep=", " } } }
' smatch=s1 col.out
col2, col4

Amazing answer. It not only works, the detailed explanation is very useful. Truely a remarkable answer. Thanks a lot.

Marc Lambrichs · Accepted Answer · 2017-09-04 14:24:25Z

solution in awk that prints a resulting row after parsing each set of 2 rows.

$ cat tst.awk
BEGIN {FS=","; p=0}
/s1|s2|s3/ {
   for (i=1; i<NF; i++) {
      if ($i=="s2") str = sprintf("%s%s", str?str ", ":str, c[i])
   };
   p=1
}
!p { for (i=1; i<NF; i++) { c[i] = $i } }
p { print str; p=0; str="" }

Rationale: build up your resultstring str when you're looping through the value-row.

whenever your input contains s1, s2 or s3, loop through the elements and - if value == s2 -, add column with index i to resultstring str; set the print var p to 1.
if p = 0 build up column array
if p = 1 print resultstring str

With input:

$ cat input.txt
col1,col2,col3,col4,col5
s1,s2,s2,s3,s1
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
col1,col2,col3,col4,col5
s1,s1,s1,s3,s3
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3

The result is:

$ awk -f tst.awk input.txt
col2, col3
col3

col3

Notice the empty 3rd line: no s2's for that one.

Aaron · Accepted Answer · 2017-09-04 13:47:44Z

I'm not so good with awk, but here is something that seems to work, outputting only the column names whose corresponding values are s1 :

#<yourTwoLines> | 
  tac | 
  awk -F ',' 'NR == 1 { for (f=1; f<=NF; f++) { relevant[f]= ($f == "s1") } };
              NR == 2 { for (f=1; f<=NF; f++) { if(relevant[f]) print($f) } }'

It works in the following way :

reverse the lines order with tac, so the value (criteria) are handled before the headers (which we will print based on the criteria).
when handling the first line (now values) with awk, store in an array which ones are s1
when handling the second line (now headers) with awk, print those who correspond to an s1 value thanks to the previously filled array.

anubhava · Accepted Answer · 2017-09-04 14:14:55Z

0

Let's say you have this:

cat file
col1,col2,col3,..,col25
s3,s1,s2,........,s2

Then you can use this awk:

awk -F, -v val='s2' '{
   s="";
  for (i=1; i<=NF; i++)
     if (NR==1)
        hdr[i]=$i
     else if ($i==val)
        s=s hdr[i] FS;
  if (s) {
     sub(/,$/, "", s);
     print s
  }
}' file

col3,col25

answered Sep 4, 2017 at 14:14

anubhava

790k67 gold badges603 silver badges671 bronze badges

1 Comment

learner Over a year ago

Thanks for your response.

Vicky · Accepted Answer · 2017-09-08 12:35:56Z

0

If order of the columns returned is not a concern

awk -F"," 'NR==1{for(i=1;i<=NF;i++){a[i]=$i};next}{for(i=1;i<=NF;i++){if($i=="s2")b[i]=$i}}END{for( i in b) m=m a[i]",";  gsub(/,$/,"", m); print m }'

answered Sep 8, 2017 at 12:35

Vicky

1,3581 gold badge17 silver badges38 bronze badges

Collectives™ on Stack Overflow

bash - select columns based on values

5 Answers 5

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

1 Comment

1 Comment

1 Comment

Comments

Linked

Related