Bash - Filter rows with a certain proportion of columns occupied

Question

So, I've got a large number of files, each one with 8 columns and a lot of rows. Here's a head from one of them for an example.

ID       Ct       1          2          3          4           5             6
1        0        consensus  -          -          -           -             -
2        0        consensus  -          -          -           -             -
3        0        consensus  consensus  consensus  consensus   consensus     consensus
4        0        consensus  -          consensus  -           -             -
5        0        -          AT         AT         GC          GC            AT
6        0        consensus  -          -          -           consensus     -
7        0        consensus  -          -          -           -             -
8        0        consensus  consensus  consensus  -           consensus     consensus
9        0        consensus  -          -          -           -             -

I want to separate out all the rows where the last 6 columns are at least 5/6 occupied. So ID 3, 5 and 8 (row 4, 6 and 9) from my head. So I want all the rows that have less than 2 columns with "-", effectively.

I used to be able to do that with a simple awk script because the program counted how many of the columns were occupied in the second column - seems like I can't do that any more. What's the best way to do it?

Should the header be printed? Are all files in the same directory? Do all files have a header? — jesse_b
– jesse_b, Commented Jan 24, 2020 at 20:41
Does the file contain tabs or spaces between the columns? (The problem with your awk script may depend on that.) — Volker Siegel
– Volker Siegel, Commented Jan 24, 2020 at 21:29
The files are in separate directories - I've got a shell script to go through each. Header, whichever's easier. All the files have a header and tabs between columns. — Dellion
– Dellion, Commented Jan 24, 2020 at 22:14

RudiC · Accepted Answer · 2020-01-24 22:33:35Z

1

How far would

awk 'gsub(/-/, "&") < 2' file
ID       Ct       1          2          3          4           5             6
3        0        consensus  consensus  consensus  consensus   consensus     consensus
5        0        -          AT         AT         GC          GC            AT
8        0        consensus  consensus  consensus  -           consensus     consensus

get you? Be aware that nothing was said rg. the desired output - do you want a single output file, file names prefixed to output lines, or new files with names similar to the original ones, or what?

EDIT (after comment on new file names):

awk 'gsub(/-/, "&") < 2 {print > (FILENAME ".new")}' /path/to/file/*

edited Jan 24, 2020 at 22:33

answered Jan 24, 2020 at 22:05

RudiC

9,0492 gold badges12 silver badges22 bronze badges

New files with similar names to the old one - so that should work if I just pipe that to a new file, right? Seems like that should do the job, thanks!

Dellion
– Dellion

2020-01-24 22:21:16 +00:00
Commented Jan 24, 2020 at 22:21
Unless you use gawk that will fail after about a dozen output files are created with an error message about too many open files. You should close() them as you go.

Ed Morton
– Ed Morton

2020-01-25 21:42:19 +00:00
Commented Jan 25, 2020 at 21:42
@ Ed Morton: Right, keep an eye on OPEN_MAX - which is 1024 on my linux, and /or kern.maxfilesperproc: 1591 on my FreeBSD.

RudiC
– RudiC

2020-01-25 22:50:34 +00:00
Commented Jan 25, 2020 at 22:50

Add a comment |

jesse_b · Accepted Answer · 2020-01-24 20:37:40Z

0

If all the files are in the same directory you can use a for loop/glob to loop over each file and run the awk command on them:

for file in /path/to/files/*; do
    awk '{
        count=0
        for (i=3;i<=8;i++) {
            if ($i == "-") {
                count++
            }
        }
        if ((count <= 1)) {
            print
        }
    }' "$file"
done

For each line, it will loop through columns 3-8, if the value of the column is equal to - it adds to count, if count is greater than 1 for a line it will not be printed.

answered Jan 24, 2020 at 20:37

jesse_b

41.5k14 gold badges108 silver badges162 bronze badges

A count seems like a good solution - I don't have all the files in the same directory, but I do have a shell script set up to apply the command to all the files. So adding that awk line into that should do nicely. Thanks!

Dellion
– Dellion

2020-01-24 22:24:15 +00:00
Commented Jan 24, 2020 at 22:24

Add a comment |

steeldriver · Accepted Answer · 2020-01-24 22:29:47Z

Perl is handy for this kind of thing - in particular, it allows fieldwise grep without an explicit loop, the result of which (when evaluated in scalar context) gives a count of matches. So for example

$ perl -lane 'print if 3 > grep { $_ eq "-" } splice @F, 2' file
ID       Ct       1          2          3          4           5             6
3        0        consensus  consensus  consensus  consensus   consensus     consensus
5        0        -          AT         AT         GC          GC            AT
8        0        consensus  consensus  consensus  -           consensus     consensus

Stack Exchange Network

Bash - Filter rows with a certain proportion of columns occupied

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Bash - Filter rows with a certain proportion of columns occupied

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions