2

I am new to bash and have the below requirement:

I have a file as below:

col1,col2,col3....col25
s1,s2,s2..........s1
col1,col2,col3....col25
s3,s2,s2..........s2

If you notice the values of these columns can be of 3 types only: s1,s2,s3

I can extract the last 2rows from the given file which gives me:

col1,col2,col3....col25
s3,s1,s2..........s2

I want to further parse the above lines so that I get only the columns with say value s1.

Desired output: say col3,col25 are the only columns with value s2, then say a comma separated value is also fine ex:

col3,col25

Can someone please help?

P.S. I found many examples where a file parsed based on the value of say 2nd (fixed) column, but how do we do it when the column number is not fixed? Checked URLs: awk one liner select only rows based on value of a column

4
  • 5
    show how should look the final result Commented Sep 4, 2017 at 12:23
  • say col2,col4 are the only columns with value s1, then say a comma separated value is also fine ex: col2, col4 Commented Sep 4, 2017 at 12:56
  • @learner, update your question with the desired result set; in this case you'll probably want to update your sample data since nothing (at the moment) shows contents for col4 (sure, we can imagine what the sample data looks like but it wouldn't hurt to show it in your question) Commented Sep 4, 2017 at 13:15
  • @markp, updated the question with desired resultset. Commented Sep 4, 2017 at 13:19

5 Answers 5

2

Assumptions:

  • there are 2 input lines
  • each input line has the same number of comma-separated items

We can use a couple arrays to collect the input data, making sure to use the same array indexes. Once the data is loaded into arrays we loop through the array looking for our value match.

$ cat col.awk
  /col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i }        }
END {
sep=""
for (i=1; i<=n; i++)
    { if (arr_s[i]==smatch)
         { printf "%s%s" ,sep,arr_c[i]
           sep=", "
         }
    }
}
  • /col1/ : for the line that contains col1, store the fields in array arr_c
  • n=NF : grab our max array index value (NF=number of fields)
  • ! /col1/ : for line that does not contain col1, store the fields in array arr_s
  • END ... : executed once the arrays have been loaded
  • sep="" : set our initial output separator to a null string
  • for (...) : loop through our array indexes (1 to n)
  • if (arr_s[i]==smatch) : if the s array value matches our input parameter (smatch - see below example), then ...
  • printf "%s%s",sep,arr_c[i] : printf our sep and the matching c array item, then ...
  • sep=", " : set our separator for the next match in the loop

We use printf because without specifying '\n' (a new line), all output goes to one line.

Example:

$ cat col.out
col1,col2,col3,col4,col5
s3,s1,s2,s1,s3
$ awk -F, -f col.awk smatch=s1 col.out                                                                                           
col2, col4
  • -F, : define the input field separator as a comma
  • here we pass in our search pattern s1 in the array variable named smatch, which is referenced in the awk code (see col.awk - above)

If you want to do the whole thing at the command line:

$ awk -F, '
  /col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i }        }
END {
sep=""
for (i=1; i<=n; i++)
    { if (arr_s[i]==smatch)
         { printf "%s%s" ,sep,arr_c[i]
           sep=", "
         }
    }
}
' smatch=s1 col.out
col2, col4

Or collapsing the END block to a single line:

awk -F, '
  /col1/ { for (i=1; i<=NF; i++) { arr_c[i]=$i } ; n=NF }
! /col1/ { for (i=1; i<=NF; i++) { arr_s[i]=$i }        }
END { sep="" ; for (i=1; i<=n; i++) { if (arr_s[i]==smatch) { printf "%s%s" ,sep,arr_c[i] ; sep=", " } } }
' smatch=s1 col.out
col2, col4
Sign up to request clarification or add additional context in comments.

1 Comment

Amazing answer. It not only works, the detailed explanation is very useful. Truely a remarkable answer. Thanks a lot.
2

solution in awk that prints a resulting row after parsing each set of 2 rows.

$ cat tst.awk
BEGIN {FS=","; p=0}
/s1|s2|s3/ {
   for (i=1; i<NF; i++) {
      if ($i=="s2") str = sprintf("%s%s", str?str ", ":str, c[i])
   };
   p=1
}
!p { for (i=1; i<NF; i++) { c[i] = $i } }
p { print str; p=0; str="" }

Rationale: build up your resultstring str when you're looping through the value-row.

  • whenever your input contains s1, s2 or s3, loop through the elements and - if value == s2 -, add column with index i to resultstring str; set the print var p to 1.
  • if p = 0 build up column array
  • if p = 1 print resultstring str

With input:

$ cat input.txt
col1,col2,col3,col4,col5
s1,s2,s2,s3,s1
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3
col1,col2,col3,col4,col5
s1,s1,s1,s3,s3
col1,col2,col3,col4,col5
s1,s1,s2,s3,s3

The result is:

$ awk -f tst.awk input.txt
col2, col3
col3

col3

Notice the empty 3rd line: no s2's for that one.

1 Comment

Thanks for your response.
1

I'm not so good with awk, but here is something that seems to work, outputting only the column names whose corresponding values are s1 :

#<yourTwoLines> | 
  tac | 
  awk -F ',' 'NR == 1 { for (f=1; f<=NF; f++) { relevant[f]= ($f == "s1") } };
              NR == 2 { for (f=1; f<=NF; f++) { if(relevant[f]) print($f) } }'

It works in the following way :

  1. reverse the lines order with tac, so the value (criteria) are handled before the headers (which we will print based on the criteria).

  2. when handling the first line (now values) with awk, store in an array which ones are s1

  3. when handling the second line (now headers) with awk, print those who correspond to an s1 value thanks to the previously filled array.

1 Comment

Thanks, will check this answer further.
0

Let's say you have this:

cat file
col1,col2,col3,..,col25
s3,s1,s2,........,s2

Then you can use this awk:

awk -F, -v val='s2' '{
   s="";
  for (i=1; i<=NF; i++)
     if (NR==1)
        hdr[i]=$i
     else if ($i==val)
        s=s hdr[i] FS;
  if (s) {
     sub(/,$/, "", s);
     print s
  }
}' file

col3,col25

1 Comment

Thanks for your response.
0

If order of the columns returned is not a concern

awk -F"," 'NR==1{for(i=1;i<=NF;i++){a[i]=$i};next}{for(i=1;i<=NF;i++){if($i=="s2")b[i]=$i}}END{for( i in b) m=m a[i]",";  gsub(/,$/,"", m); print m }'

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.