4

I have a file with several rows and columns. I want to select columns where the number 2 appears more than x times.

My tab separated file looks like this:

Individuals  M1 M2 M3
Ind1          0 0  2
Ind2          0 2  2
Ind3          2 2  2

In this cartoon example, let's say that I want the columns where the number 2 appears two or more times. My output would be:

Individuals   M2 M3
Ind1          0  2
Ind2          2  2
Ind3          2  2

With R this is quite easy, but takes forever because the file is too big, so I would like to do it with awk or something similar. Could you please tell me how to achieve this?

1
  • I'm not sure how the performance of datamash transpose is for your file. Try datamash transpose -W < filename |grep -E -e Indi -e "(2.*){2,}" | datamash transpose -W Commented Sep 11, 2019 at 10:11

2 Answers 2

1
BEGIN { OFS = FS = "\t" }

FNR == NR {
        for (i = 2; i <= NF; ++i)
                if ($i == 2) ++c[i]
        next
}

{
        a[nf=1] = $1
        for (i = 2; i <= NF; ++i)
                if (c[i] >= t) a[++nf] = $i

        $0 = ""
        for (i = 1; i <= nf; ++i)
                $i = a[i]

        print
}

This awk program would count the number of occurrences of the value 2 in each column and store these counts in the array c (one lement in this array per column of data). It does this while reading the input file the first time (this is the FNR == NR block).

When reading the input file a second time it uses these counts to transfer the appropriate columns from the input to the array a for each line read. The value of the variable t is used as the threshold value to decide whether the column should be included or not. This is the first for loop in the last block in the code.

It then creates a new data record from this array and prints it.

Testing it (note that the input file is given twice on the command line for awk to be able to do two passes over it):

$ cat file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2
$ awk -v t=1 -f script.awk file file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2
$ awk -v t=2 -f script.awk file file
Individuals     M2      M3
Ind1    0       2
Ind2    2       2
Ind3    2       2
$ awk -v t=3 -f script.awk file file
Individuals     M3
Ind1    2
Ind2    2
Ind3    2
$ awk -v t=4 -f script.awk file file
Individuals
Ind1
Ind2
Ind3
2
  • It is working very well. Thanks!! Do you recommend me any material (web page or anything) to learn better awk? Commented Sep 11, 2019 at 11:26
  • @EricGonzález I can only say that I've learnt using awk by frequently consulting the manual (man awk), solving problems, and also by carefully reading other people's solutions on this site (and their corrections to my own answers). In general, you learn better by solving real problems. Commented Sep 11, 2019 at 11:43
1

Not sure if this is any fast:

awk -v value=0 '
NR==FNR{for(i=2;i<=NF;i++){if($i==value){s[i]++}}}
NR!=FNR {
  printf "%s"OFS,$1
  for (i=2;i<=NF;i++){if(s[i]>1)last=i}
  for (i=2;i<=NF;i++){
    if(s[i]>1){
      if (i==last)printf "%s\n",$i
      else printf "%s"OFS,$i}
  }
}
' file file

You might want to set OFS to tab (BEGIN{OFS="\t"}.)

1
  • maybe someone can come with a more elegant solution for finding last key with value > 1... Commented Sep 11, 2019 at 10:34

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.