I have stuck with this script it would be great if you could help me with your inputs. My problem is that I think the script is not that efficient - it takes a lot of time to end running.
I have a fasta file with around 9000 sequence lines (example below) and What my script does is:
- reads the first line (ignores lines start with
>) and makes 6mers (6 character blocks) - adds these 6mers to a list
- makes reverse-complement of previous 6mers (list2)
- saves the line if non of the reverse-complement 6mers are in the line.
- Then goes to the next line in the file, and check if it contains any of the reverse-complement 6mers (in list2). If it does, it discards it. If it does not, it saves that line, and reads all reverse complement 6-mers of the new one into the list2 - in addition to the reverse-complement 6-mers that were already there.
my file:
>seq1
TCAGATGTGTATAAGAGACAGTTATTAGCCGGTTCCAGGTATGCAGTATGAGAA
>seq2
TCAGATGTGTATAAGAGACAGCGCCTTAATGTTGTCAGATGTCGAAGGTTAGAA
>seq3
TCAGATGTGTATAAGAGACAGTGTTACAGCGAGTGTTATTCCCAAGTTGAGGAA
>seq4
TCAGATGTGTATAAGAGACAGTTACCTGGCTGCAATATGGTTTTAGAGGACGAA
and this is my code:
import sys
from Bio import SeqIO
from Bio.Seq import Seq
def hetero_dimerization():
script = sys.argv[0]
file1 = sys.argv[1]
list = []
list2 = []
with open(file1, 'r') as file:
for record in SeqIO.parse(file, 'fasta'):
for i in range(len(record.seq)):
kmer = str(record.seq[i:i + 6])
if len(kmer) == 6:
list.append(kmer)
#print(record.seq)
#print(list)
for kmers in list:
C_kmer = Seq(kmers).complement()
list2.append(C_kmer[::-1])
#print(list2)
cnt=0
if any(items in record.seq for items in list2):
cnt +=1
if cnt == 0:
print('>'+record.id)
print(record.seq)
if __name__ == '__main__':
hetero_dimerization()
6mer, you calculate the reverse complement of each6meryou have already found and append it tolist2. let's number the found 6mersm1, m2, ...and the respective complementsc1, c2,...; after the third iteration,listwill contain[m1,m2,m3], andlist2will contain[c1,c1,c2,c1,c2,c3]. Could you please clarify if that is intended and, if yes, why? \$\endgroup\$[m1,m2,m3]fromseq1and their respective complements are should be added to list2[c1,c2,c3]and when iteration over theseq2- the script first should look if any of the[c1,c2,c3]are inseq2if yes then theseq2should be discarded else should be saved and its respective 6mer complements [c4,c5,c6] should be added to the list2 and the updated list2 should be[c1,c2,c3,c4,c5,c6]\$\endgroup\$seq3, if any of respective complements are in seq3 then thisseq3should be discarded, else should be saved and its respective 6mer complements should be added to the list2 and the updated list2 should be[c1,c2,c3,c4,c5,c6,c7,c8,c9,...]\$\endgroup\$