Select columns where a value appears more than x times

Question

I have a file with several rows and columns. I want to select columns where the number 2 appears more than x times.

My tab separated file looks like this:

Individuals  M1 M2 M3
Ind1          0 0  2
Ind2          0 2  2
Ind3          2 2  2

In this cartoon example, let's say that I want the columns where the number 2 appears two or more times. My output would be:

Individuals   M2 M3
Ind1          0  2
Ind2          2  2
Ind3          2  2

With R this is quite easy, but takes forever because the file is too big, so I would like to do it with awk or something similar. Could you please tell me how to achieve this?

I'm not sure how the performance of datamash transpose is for your file. Try datamash transpose -W < filename |grep -E -e Indi -e "(2.*){2,}" | datamash transpose -W — Philippos
– Philippos, Commented Sep 11, 2019 at 10:11

Kusalananda · Accepted Answer · 2019-09-11 10:43:36Z

BEGIN { OFS = FS = "\t" }

FNR == NR {
        for (i = 2; i <= NF; ++i)
                if ($i == 2) ++c[i]
        next
}

{
        a[nf=1] = $1
        for (i = 2; i <= NF; ++i)
                if (c[i] >= t) a[++nf] = $i

        $0 = ""
        for (i = 1; i <= nf; ++i)
                $i = a[i]

        print
}

This awk program would count the number of occurrences of the value 2 in each column and store these counts in the array c (one lement in this array per column of data). It does this while reading the input file the first time (this is the FNR == NR block).

When reading the input file a second time it uses these counts to transfer the appropriate columns from the input to the array a for each line read. The value of the variable t is used as the threshold value to decide whether the column should be included or not. This is the first for loop in the last block in the code.

It then creates a new data record from this array and prints it.

Testing it (note that the input file is given twice on the command line for awk to be able to do two passes over it):

$ cat file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2

$ awk -v t=1 -f script.awk file file
Individuals     M1      M2      M3
Ind1    0       0       2
Ind2    0       2       2
Ind3    2       2       2

$ awk -v t=2 -f script.awk file file
Individuals     M2      M3
Ind1    0       2
Ind2    2       2
Ind3    2       2

$ awk -v t=3 -f script.awk file file
Individuals     M3
Ind1    2
Ind2    2
Ind3    2

$ awk -v t=4 -f script.awk file file
Individuals
Ind1
Ind2
Ind3

It is working very well. Thanks!! Do you recommend me any material (web page or anything) to learn better awk? — Eric González
– Eric González, Commented Sep 11, 2019 at 11:26
@EricGonzález I can only say that I've learnt using awk by frequently consulting the manual (man awk), solving problems, and also by carefully reading other people's solutions on this site (and their corrections to my own answers). In general, you learn better by solving real problems. — Kusalananda
– Kusalananda ♦, Commented Sep 11, 2019 at 11:43

pLumo · Accepted Answer · 2019-09-11 10:33:48Z

1

Not sure if this is any fast:

awk -v value=0 '
NR==FNR{for(i=2;i<=NF;i++){if($i==value){s[i]++}}}
NR!=FNR {
  printf "%s"OFS,$1
  for (i=2;i<=NF;i++){if(s[i]>1)last=i}
  for (i=2;i<=NF;i++){
    if(s[i]>1){
      if (i==last)printf "%s\n",$i
      else printf "%s"OFS,$i}
  }
}
' file file

You might want to set OFS to tab (BEGIN{OFS="\t"}.)

edited Sep 11, 2019 at 10:33

answered Sep 11, 2019 at 10:28

pLumo

23.2k2 gold badges43 silver badges70 bronze badges

maybe someone can come with a more elegant solution for finding last key with value > 1...

pLumo
– pLumo

2019-09-11 10:34:10 +00:00
Commented Sep 11, 2019 at 10:34

Add a comment |

Stack Exchange Network

Select columns where a value appears more than x times

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Select columns where a value appears more than x times

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions