Skip to main content
2 of 4
added 541 characters in body
Jerry
  • 161
  • 1
  • 2
  • 9

Bash: Nested while loop to detect duplicates and number the duplicates

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Jerry
  • 161
  • 1
  • 2
  • 9