0

I have the following example of a dataframe.Where you see that elements of 3nd column could be duplicated.I want to keep the entry which has the highest value in column 5

Meaning that for AGCCCGGGG I want to keep the second entry which the 5th column has the value of 49.

A00643:620:HFM7YDSX5:1:1124:7120:12352  ATCAGCCCGGGGCTTGGGCTAGGAC   GGGTGTGTG   548476  0   Corynebacterium
A00643:620:HFM7YDSX5:1:1150:15953:12524 CCTATCGTCGCTGGAATTCCCCGGG   AGCCCGGGG   1458266 1   Bordetella
A00643:620:HFM7YDSX5:1:1150:15628:12743 CCTATCGTCGCTGGAATTCCCCGGG   AGCCCGGGG   1458266 49  Bordetella
A00643:620:HFM7YDSX5:1:1450:4001:4507   GGCGATCGAAATGTCAAGCCCGGGG   TCTTGTGGT   585529  0   Corynebacterium
A00643:620:HFM7YDSX5:1:2124:8865:2472   ATCAGCCCGGGGCTTGGGCTAGGAC   GGGTGTGTG   548476  0   Corynebacterium
A00643:620:HFM7YDSX5:1:2476:4001:29496  ATTCACCCTATAGGAGCCCGGGGCA   TGCCCCGGG   1458266 0   Bordetella
2
  • Does your data use spaces or tabs as separators? Are all columns always present? May the data be resorted? How to handle if the number in $5 appears multiple times for equal $3? Commented May 15, 2023 at 10:41
  • yes it's tab separated and all columns are always present. If the number in $5 appears multiple times select randomly. Commented May 15, 2023 at 11:12

3 Answers 3

0

awk is useful tool here:

awk -F'\t' 'l[$3] {if ($5>n[$3]) {n[$3]=$5; l[$3]=$0} ; next} 
            {n[$3]=$5 ; l[$3]=$0}
            END { for (i in l) {print l[i]}}' infile

-F'\t' - use tabs as field separators

Let's start with the second line: n[$3]=$5 stores the number in column 5 in an array n indexed by column 3 and the wole line in an array l by the same index. However, this will only happen at the first occurence of a unique value in colunm 3, since with:

l[$3] {...} commands in braces are only executed if an element in array l with index $3 (=column 3) is present. In this case compare the stored value in n to column 5 and update if need be. next means skip to the next record, i.e. line of the file.

END - loops thorugh array l and returns all lines with unique $3 and (first) highest value in $5. Sorting of original file is `not maintained.

0
0

Using any sort and any awk:

$ sort -rnk5,5 file | awk '!seen[$3]++'
A00643:620:HFM7YDSX5:1:1150:15628:12743 CCTATCGTCGCTGGAATTCCCCGGG       AGCCCGGGG       1458266 49      Bordetella
A00643:620:HFM7YDSX5:1:2476:4001:29496  ATTCACCCTATAGGAGCCCGGGGCA       TGCCCCGGG       1458266 0       Bordetella
A00643:620:HFM7YDSX5:1:2124:8865:2472   ATCAGCCCGGGGCTTGGGCTAGGAC       GGGTGTGTG       548476  0       Corynebacterium
A00643:620:HFM7YDSX5:1:1450:4001:4507   GGCGATCGAAATGTCAAGCCCGGGG       TCTTGTGGT       585529  0       Corynebacterium

or using any awk on it's own:

$ awk '
    !($3 in max) || ($5 > max[$3]) { max[$3]=$5; line[$3]=$0 }
    END { for (key in max) print line[key] }
' file
A00643:620:HFM7YDSX5:1:2476:4001:29496  ATTCACCCTATAGGAGCCCGGGGCA       TGCCCCGGG       1458266 0       Bordetella
A00643:620:HFM7YDSX5:1:1150:15628:12743 CCTATCGTCGCTGGAATTCCCCGGG       AGCCCGGGG       1458266 49      Bordetella
A00643:620:HFM7YDSX5:1:2124:8865:2472   ATCAGCCCGGGGCTTGGGCTAGGAC       GGGTGTGTG       548476  0       Corynebacterium
A00643:620:HFM7YDSX5:1:1450:4001:4507   GGCGATCGAAATGTCAAGCCCGGGG       TCTTGTGGT       585529  0       Corynebacterium

We use !($3 in max) to initialize max[$3] the first time $3 is set rather than based on any $5 value so it will work even if the max value was 0 or negative. The rule of thumb for min/max calculations is to always initialize using the first value read, not 0 or any other arbitrary value.

We don't need so set FS to a tab since you said all fields are always present and best I can tell there can never be a blank in any of the first 5 fields.

0

Just use sort:

sort -k3,3 -k5,5nr /tmp/my_acgt_file | sort -k3,3 -u

Left-hand side sorts your file on the 3rd field (lowest first) and then, in case of equality, on the 5th field (highest first).

Right-hand side only considers the (already sorted) 3rd field and guaranties it will be unique by keeping the first it meets and discarding the others.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.