Group IDs with defined range

Question

I have a sorted file of IDs and numbers (positions). I need to group the positions in the 2nd column into intervals of 500 in a group.

If the values of the row, when compared to the previous row are less than 500, they are grouped into the same group; while if the values of the row are more than 500, they are grouped into different group.

Input file:

snp00001    200
snp00002    300
snp00003    400
snp00004    500
snp00005    600
snp00006    900
snp00007    1500
snp00008    1800
snp00009    3000
snp00010    3500
snp00011    4000
snp00012    5000

Desired output

snp00001 200 Group1
snp00002 300 Group1
snp00003 400 Group1
snp00004 500 Group1
snp00005 600 Group1
snp00006 900 Group1
snp00007 1500 Group2
snp00008 1800 Group2
snp00009 3000 Group3
snp00010 3500 Group3
snp00011 4000 Group4
snp00012 5000 Group5

Extra note: snp00001 to snp00006 will be grouped into the same group, because the range between them (snp00002 - snp00001) or (snp00003 - snp00002) or (snp00004 - snp00003) ... is less than 500.

snp00006 and snp00007 are grouped into the next group, because the range between them (snp00007 - snp00006) is more than 500.

I've tried with awk, but with no success.

awk -v step=500 -v OFS='\t' '{if(NR==1 || $2+limit){group++} file="Group"group; print file,$0}' input_file

Stephen Harris · Accepted Answer · 2022-04-20 01:35:55Z

2

You need to keep track of the previous value and compare the current value to this saved one. If the difference is over 500 then increase the group number.

eg

awk -v group=1 '{if ($2-prev>500) { group++ }} {prev=$2; $3="group" group; print}'
snp00001 200 group1
snp00002 300 group1
snp00003 400 group1
snp00004 500 group1
snp00005 600 group1
snp00006 900 group1
snp00007 1500 group2
snp00008 1800 group2
snp00009 3000 group3
snp00010 3500 group3
snp00011 4000 group3
snp00012 5000 group4

(FWIW, your 9/10/11 output is inconsistent; 9->10 is 500 but doesn't increase group, but 10->11 is also 500 but does increase group).

answered Apr 20, 2022 at 1:35

Stephen Harris

49.3k7 gold badges115 silver badges138 bronze badges

Thank you @Stephen Harris!

austin7923
– austin7923

2022-04-20 01:44:56 +00:00
Commented Apr 20, 2022 at 1:44

Add a comment |

jubilatious1 · Accepted Answer · 2022-06-18 02:16:01Z

Using Raku (formerly known as Perl_6)

This is a somewhat different grouping scheme that may prove useful. It uses Raku's ~~ smartmatch operator to quickly say whether a position lies within a range (or not):

~$ raku -e 'my $i = 1; my $r = 1..500; for lines() {my $a = .words;  \
            if ($a.[1].Int ~~ $r) {say "$a Group", $i, " ", $r} else {  \
            repeat { $r+=500 } until ($a.[1].Int ~~ $r);  \
            say "$a Group", ++$i, " ", $r };}' file

Sample Input:

snp00001    200
snp00002    300
snp00003    400
snp00004    500
snp00005    600
snp00006    900
snp00007    1500
snp00008    1800
snp00009    3000
snp00010    3500
snp00011    4000
snp00012    5000

Sample Output (groups SNPs every 500 nucleotides starting from nucleotide 1):

snp00001 200 Group1 1..500
snp00002 300 Group1 1..500
snp00003 400 Group1 1..500
snp00004 500 Group1 1..500
snp00005 600 Group2 501..1000
snp00006 900 Group2 501..1000
snp00007 1500 Group3 1001..1500
snp00008 1800 Group4 1501..2000
snp00009 3000 Group5 2501..3000
snp00010 3500 Group6 3001..3500
snp00011 4000 Group7 3501..4000
snp00012 5000 Group8 4501..5000

The Raku code above declares a Group# iterator $i and an initial range $r of 1..500. Input is taken as lines, and each line is broken into (whitespace-delimited) words. An if/else conditional is run: if the second column ~~ smartmatches within the $r range, say the line, Group#, and range, else take $r range and repeatedly increment by 500, while not (i.e. until) the ~~ smartmatch succeeds. Then print the same info as previously, but this time with the Group# properly incremented (++$i).

The advantage of the grouping scheme above is that resultant Groups are all of equal interval length, in this case ~500 nucleotides. This scheme prevents 'dilation' of Group interval lengths, as might occur when multiple SNPs are located somewhat together (interval 'dilation' that might create a false impression of 'clustering').

To make this a more general 'grouping' tool, you can abstract out the right side of the Range into a variable ($m), for quick grouping:

~$ raku -e 'my $i=1; my $m=1000; my $r = 1..$m; for lines() {my $a = .words;   if ($a.[1].Int ~~ $r) {say "$a\tGroup$i\t", $r} else { repeat { $r+=$m } until ($a.[1].Int ~~ $r); say "$a\tGroup{++$i}\t", $r };}' file
snp00001 200    Group1  1..1000
snp00002 300    Group1  1..1000
snp00003 400    Group1  1..1000
snp00004 500    Group1  1..1000
snp00005 600    Group1  1..1000
snp00006 900    Group1  1..1000
snp00007 1500   Group2  1001..2000
snp00008 1800   Group2  1001..2000
snp00009 3000   Group3  2001..3000
snp00010 3500   Group4  3001..4000
snp00011 4000   Group4  3001..4000
snp00012 5000   Group5  4001..5000

https://docs.raku.org/type/Range
https://raku.org

Stack Exchange Network

Group IDs with defined range

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Group IDs with defined range

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions