Revisions to Bash: Nested while loop to detect duplicates and number the duplicates

deleted 8 characters in body

Source Link

edited Nov 5, 2020 at 10:44

161
1
2
9

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus
Homo sapiens
Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up) **** Gene sequences are all in one file

Mus musculus 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
 
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
 
Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
 
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
 
Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus
Homo sapiens
Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
 
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
 
Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
 
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
 
Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus
Homo sapiens
Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up) **** Gene sequences are all in one file

Mus musculus 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG
Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD
Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Formatting and tags

Source Link

edited Nov 5, 2020 at 10:19

AdminBee

23.6k
25
55
77

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txtuniqueheaders.txt). I removed all the duplicates in uniqueheaders.txtuniqueheaders.txt. I

I am trying to loop read a line of uniqueheaders.txtuniqueheaders.txt then loop read headers.txtheaders.txt to check for duplicates. The ifif statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txtheaders.txt so I insert them back into my fastaFASTA file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

sed: -e expression #1, char 1: unknown command: 'M'

Homo sapiensand

sed: -e expression #1, char 2: extra characters after command

Rattus norvegicusBoth files contain unique header names:

Mus musculus
Homo sapiens
Rattus norvegicus

How do I modify the sedsed command to prevent this error? Is there a better way of doing this in bashbash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Mus musculus 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Mus musculus1 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt.

I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my FASTA file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error

sed: -e expression #1, char 1: unknown command: 'M'

and

sed: -e expression #1, char 2: extra characters after command

Both files contain unique header names:

Mus musculus
Homo sapiens
Rattus norvegicus

How do I modify the sed command to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1 
MDFJSGHDFSBGKJBDFSGKJBDFS
NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus
SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2
NJALDJASJDLAJSJAPOJPOASDJG
DSFHBDSFHSDFHDFSHJDFSJKSSF

added 541 characters in body

Source Link

edited Nov 5, 2020 at 10:14

Jerry

161
1
2
9

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

So I have a textfile with headers for genes and there are different gene sequences under the same species. So I have extracted the headers (headers.txt) and copied it into another file (uniqueheaders.txt). I removed all the duplicates in uniqueheaders.txt. I am trying to loop read a line of uniqueheaders.txt then loop read headers.txt to check for duplicates. The if statement detects the duplicate and increments a counter to append it to the header. This will number all the headers in headers.txt so I insert them back into my fasta file. my code is here:

while IFS= read -r uniqueline
do
    counter=0
    while IFS= read headline
    do
        if [ "$uniqueline" == "$headline" ]
        then
            let "counter++"
            #append counter to the headline variable to number it.
            sed "$headline s/$/$counter/" -i headers
        if
    done < headers.txt
done < uniqueheaders.txt

The issue is that the terminal keeps spitting out the error sed: -e expression #1, char 1: unknown command: 'M' and sed: -e expression #1, char 2: extra characters after command. Both files contain unique header names:

Mus musculus

Homo sapiens

Rattus norvegicus

How do I modify the sed to prevent this error? Is there a better way of doing this in bash?

Example of inputs (note that gene sequence don't really have a pattern in terms of how many lines it takes up)

Mus musculus

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Desired output:

Mus musculus1

MDFJSGHDFSBGKJBDFSGKJBDFS

NGBJDFSBGKJDFSHNGKJDFSGHG

Rattus norvegicus

SNOFBDSFNLSFSFSFSJFJSDFSD

Mus musculus2

NJALDJASJDLAJSJAPOJPOASDJG

DSFHBDSFHSDFHDFSHJDFSJKSSF

Source Link

asked Nov 5, 2020 at 9:22

Jerry

161
1
2
9

Loading

Stack Exchange Network

Return to Question