Counting DNA codons in DNA file

Question

I want to create a bash script that takes in a dna file and checks that it has no newline characters or white space characters, and then outputs the unique codons along with their count of the number of times they occur. I have used the following code but the codon keeps giving me an output of "bash-3.2$". I am so confused as to whether my syntax is wrong and why I'm not getting the proper output.

! /bin/bash

for (( pos=1; pos < length - 1; ++pos )); do
    codon = substr($1, $pos, 3)
    tr-d '\n' $1 | awk -f '{print $codon}' | sort | uniq -c
done

For example if a file named dnafile contains the pattern aacacgaactttaacacg then the script will take the following input and output

 $script dnafile              
 aac 3
 acg 2
 ttt 1

A few notes: i) you really, really don't want to do this sort of thing in the shell, especially not for large files. ii) you should probably look at some basic tutorials on shell scripting: there is no such thing as substr, you can't have spaces around the = in variable assignments, and your shebang is wrong. iii) remember that there are 6 possible reading frames, are you sure you only need to look at one? iv) your dna file will almost never just have sequence in it, you usually have some sort of header and extra information (fasta, fastq, sam tec.) — terdon
– terdon ♦, Commented Apr 13, 2020 at 13:16

Kusalananda · Accepted Answer · 2020-04-12 17:28:08Z

9

You get that output because the first line of your script starts a new bash shell.

That line should read

#!/bin/bash

(note the # at the start).

You then mix awk syntax with shell code in a way that will never work.

Instead, keep it simple and chop up your file in groups of three characters, sort these and count how many unique ones you get:

$ fold -w 3 dnafile | sort | uniq -c
   3 aac
   2 acg
   1 ttt

This would work as long as the input always contains a multiple of three characters, with no embedded spaces or other characters.

answered Apr 12, 2020 at 17:28

Kusalananda♦

356k42 gold badges735 silver badges1.1k bronze badges

how would I check for spaces and if it contains a multiple of three characters?

Rajveer Singh
– Rajveer Singh

2020-04-12 17:44:57 +00:00
Commented Apr 12, 2020 at 17:44
@RajveerSingh You could delete spaces and newlines etc. with tr -d '\n ' <dnafile | fold -w 3 | sort | uniq -c. If it's not a multiple of three characters in the file, there will be a short codon in the output.

Kusalananda
– Kusalananda ♦

2020-04-12 18:03:43 +00:00
Commented Apr 12, 2020 at 18:03
when I use tr -d '\n ' <dnafile | fold -w 3 | sort | uniq -c, I get the following error: tr [-Ccsu] string1 string2 tr [-Ccu] -d string1 tr [-Ccu] -s string1 tr [-Ccu] -ds string1 string2.

Rajveer Singh
– Rajveer Singh

2020-04-12 18:19:13 +00:00
Commented Apr 12, 2020 at 18:19
@RajveerSingh There is nothing wrong with your command in your latest comment. In fact, I'm able to copy it from the comment and run it directly.

Kusalananda
– Kusalananda ♦

2020-04-12 18:56:51 +00:00
Commented Apr 12, 2020 at 18:56
No need to sort: fold -w 3 dna | perl -ne '$a{$_}++;END{print map {$a{$_}." ".$_} keys %a}'

Ole Tange
– Ole Tange

2020-04-12 21:05:54 +00:00
Commented Apr 12, 2020 at 21:05

Add a comment |

Ole Tange · Accepted Answer · 2020-04-13 14:19:06Z

3

(echo aacacgaactttaacacg ;echo aacacgaactttaacacg ) |
  perl -ne '# Split input into triplets (A3)
            # use each triplet as key in the hash table count
            #   and increase the value for the key
            map { $count{$_}++ } unpack("(A3)*",$_);
            # When we are at the end of the file
            END{ 
                 # Remove the key "" (which is wrong)
                 delete $count{""};
                 # For each key: Print key, count
                 print map { "$_ $count{$_}\n" } keys %count
            }'

edited Apr 13, 2020 at 14:19

answered Apr 12, 2020 at 16:04

Ole Tange

37.5k34 gold badges119 silver badges226 bronze badges

Add a comment |

bu5hman · Accepted Answer · 2020-04-12 17:28:30Z

A slightly more long-winded awk version

awk 'BEGINFILE{print FILENAME; delete codon}
     ENDFILE {
     if (NR!=1 || NF!=1 || length($0)%3!=0){
         print "is broken"}
     else{
         for (i=1; i<=length($0); i+=3) codon[substr($0,i,3)]++}; 
         for (c in codon) print c, codon[c]; 
         print ""}' file*

For this input

file1 : OK

aacacgaactttaacacg

file2 : space

aacacgaact ttaacacg

file3 : linebreak

aacacgaact
ttaacacg

file4 : not a multiple of 3 bases

aacacgaactttaacac

You get

file1
aac 3
ttt 1
acg 2

file2
is broken

file3
is broken

file4
is broken

If you just want to repair the files and have none like file4 then cat your files through tr from one end of awk or the other, just like your example

<<< $(cat file[1..3] | tr -d "\n ")

Stack Exchange Network

Counting DNA codons in DNA file

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Counting DNA codons in DNA file

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions