Inefficient DNA heterodimerization script

Question

I have stuck with this script it would be great if you could help me with your inputs. My problem is that I think the script is not that efficient - it takes a lot of time to end running.

I have a fasta file with around 9000 sequence lines (example below) and What my script does is:

reads the first line (ignores lines start with >) and makes 6mers (6 character blocks)
adds these 6mers to a list
makes reverse-complement of previous 6mers (list2)
saves the line if non of the reverse-complement 6mers are in the line.
Then goes to the next line in the file, and check if it contains any of the reverse-complement 6mers (in list2). If it does, it discards it. If it does not, it saves that line, and reads all reverse complement 6-mers of the new one into the list2 - in addition to the reverse-complement 6-mers that were already there.

my file:

>seq1
TCAGATGTGTATAAGAGACAGTTATTAGCCGGTTCCAGGTATGCAGTATGAGAA
>seq2
TCAGATGTGTATAAGAGACAGCGCCTTAATGTTGTCAGATGTCGAAGGTTAGAA
>seq3
TCAGATGTGTATAAGAGACAGTGTTACAGCGAGTGTTATTCCCAAGTTGAGGAA
>seq4
TCAGATGTGTATAAGAGACAGTTACCTGGCTGCAATATGGTTTTAGAGGACGAA

and this is my code:

import sys
from Bio import SeqIO
from Bio.Seq import Seq

def hetero_dimerization():
    script = sys.argv[0]
    file1 = sys.argv[1]
    list = []
    list2 = []
    with open(file1, 'r') as file:
        for record in SeqIO.parse(file, 'fasta'):
            for i in range(len(record.seq)):
                kmer = str(record.seq[i:i + 6])
                if len(kmer) == 6:
                    list.append(kmer)
            #print(record.seq)
            #print(list)

            for kmers in list:
                C_kmer = Seq(kmers).complement()
                list2.append(C_kmer[::-1])
            #print(list2)

            cnt=0
            if any(items in record.seq for items in list2):
                cnt +=1

            if cnt == 0:
                print('>'+record.id)
                print(record.seq)
                
if __name__ == '__main__':
    hetero_dimerization()

The current question title, which states your concerns about the code, applies to too many questions on this site to be useful. The site standard is for the title to simply state the task accomplished by the code. Please see How do I ask a good question?. — BCdotWEB
– BCdotWEB, Commented Mar 17, 2021 at 15:37
I have zero biology knowledge but there is something I suspect may not be intended: every time you find a 6mer, you calculate the reverse complement of each 6mer you have already found and append it to list2. let's number the found 6mers m1, m2, ... and the respective complements c1, c2,...; after the third iteration, list will contain [m1,m2,m3], and list2 will contain [c1,c1,c2,c1,c2,c3]. Could you please clarify if that is intended and, if yes, why? — danzel
– danzel, Commented Mar 17, 2021 at 16:54
@danzel - that was already a good hint, could you please help me to solve this? it should not be like that - meaning let's suppose in the first iteration I have 6mers [m1,m2,m3] from seq1 and their respective complements are should be added to list2 [c1,c2,c3] and when iteration over the seq2 - the script first should look if any of the [c1,c2,c3] are in seq2 if yes then the seq2 should be discarded else should be saved and its respective 6mer complements [c4,c5,c6] should be added to the list2 and the updated list2 should be [c1,c2,c3,c4,c5,c6] — Apex
– Apex, Commented Mar 17, 2021 at 17:15
Same for the next line  when reading seq3, if any of respective complements are in seq3 then this seq3 should be discarded, else should be saved and its respective 6mer complements should be added to the list2 and the updated list2 should be [c1,c2,c3,c4,c5,c6,c7,c8,c9,...] — Apex
– Apex, Commented Mar 17, 2021 at 17:15
Dear @danzel it would be great if I can get any help from you - many thanks — Apex
– Apex, Commented Mar 17, 2021 at 20:12

FirefoxMetzger · Accepted Answer · 2021-03-18 20:40:53Z

Here is your original code with comments on which lines should be revised for clarity

import sys
from Bio import SeqIO
from Bio.Seq import Seq

def hetero_dimerization():
    script = sys.argv[0]  # script is unused; the line should be removed
    file1 = sys.argv[1]
    list = []  # list is a python built-in python; overwrite it at your own peril. I'd suggest renaming this variable
    list2 = []
    with open(file1, 'r') as file:
        for record in SeqIO.parse(file, 'fasta'):
            for i in range(len(record.seq)):
                kmer = str(record.seq[i:i + 6])
                if len(kmer) == 6:  # this check is redundant. kmer is guaranteed to be of length 6
                    list.append(kmer)
            #print(record.seq)  # if you submit code for review, avoid commented out source code. it is confusing
            #print(list)  # same here, avoid commented out code

            # this is a bug. You add every kmer to the list even if you discard the sequence
            # you also add redundant ones, which causes considerable slowdown
            for kmers in list:
                C_kmer = Seq(kmers).complement()
                list2.append(C_kmer[::-1])
            #print(list2)  # avoid commented out code

            # adding a counter for a simple if is not very clean
            # instead consider `if not any(...):`
            cnt=0
            if any(items in record.seq for items in list2):
                cnt +=1

            if cnt == 0:
                print('>'+record.id)
                print(record.seq)
                
if __name__ == '__main__':
    hetero_dimerization()

Your code contains a bug (@danzel already pointed that out), as you append the list if inverse complements for every 6mer you encounter, irrespective of it being from a sequence you wish to discard and irrespective of it being already contained in the list. You've fixed the bug in your updated code and in the process also changed the algorithm to avoid a lot of unnecessary work.

You can slightly (~15%) improve this further by (a) switching the complement list to a set and searching the set for each 6mer in the sequence (instead of the other way around) and by stopping the search early (breaking) if you encounter a 6mer that is already in the list. I also took the liberty to python-ify your code a bit in the process.

def kmer_generator(record, k=6):
    sequence = record.seq
    for idx in range(len(sequence) - (k-1)):
        yield sequence[idx:(idx+6)]

def hetero_dimerization_optimized():
    taboo_list = set()
    resulting_sequences = list()
    with open("fasta-file", 'r') as file:
        for record in SeqIO.parse(file, 'fasta'):
            new_taboo_items = set()
            for kmer in kmer_generator(record):
                if str(kmer) in taboo_list:
                    break
                complement = str(kmer.reverse_complement())
                new_taboo_items.add(complement)
            else:
                # this is entered only if no break occurs
                # i.e. if no kmer was in the taboo_list :)
                taboo_list = taboo_list.union(new_taboo_items)
                resulting_sequences.append(record.seq)
    return resulting_sequences

From here, your most noticable speedups would come from (a) writing your own reverse_complement function that reduces constant overhead (~50% of the runtime is spend on this line), (b) switching to numpy and using chararray to get access to np.view and avoid further copies of your data; this is in addition to (a) and could save you an additional 20%+ runtime.

After this, you could look into multiprocessing to use additional cores to pre-filter sequences, but tbh. this will likely be over-engieered for a mere 9000 sequences.

Apex · Accepted Answer · 2021-03-18 12:17:48Z

I think this should work - this is my own answer, please give your inputs

I removed one list to avoid another layer of for loop in the code and performed the reverse-complement conversion in one compact loop. Also, I added the pass - else command to avoid writing 6mers of lines that do not meet the condition.

with open('file.fasta', 'r') as file:
    list = []
    for record in SeqIO.parse(file, 'fasta'):

        cnt = 0
        if any(items in record.seq for items in list):
            cnt += 1
            pass
        else:
            for i in range(len(record.seq)):
                kmer = str(record.seq[i:i + 6])
                if len(kmer) == 6:
                    C_kmer = Seq(kmer).complement()
                    list.append(C_kmer[::-1])

        if cnt == 0:
            print('>' + record.id)
            print(record.seq)

A rewrite is not a code review. You'll need to explain the changes. — BCdotWEB
– BCdotWEB, Commented Mar 17, 2021 at 22:53
Welcome to Code Review! You have presented an alternative solution, but haven't reviewed the code. Please explain your reasoning (how your solution works and why it is better than the original) so that the author and other readers can learn from your thought process. — Vogel612
– Vogel612, Commented Mar 18, 2021 at 12:02

Stack Exchange Network

Inefficient DNA heterodimerization script

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Inefficient DNA heterodimerization script

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions