Tweeted twitter.com/StackCodeReview/status/1279293872028504065

occurred Jul 4, 2020 at 6:00

added 823 characters in body

Source Link

edited Jul 3, 2020 at 18:15

491
2
8

I'm not very familiar with functions (its why I have randomly assigned globals at the bottom, it's my "cheating" way of trying to make everything work. Additionally, this is also my first time trying to design user inputs in the terminal and using those as "flags" (i.e. if user types this in, do this). In its current state its a little ugly (in both the main_loop and reverse/forward loops I have dependencies on user input and multiple nested loops).

Thus I'm looking for 2 things:

A way to clean up some of the user input lines so I don't have this multiple nested main loop. And feedback on the design/structure and use of my functions.

Is the code structured/properly is it clean? Are the methodologies used "best practices". In other words, are there better ways to do what I am attempting to do.

I'm writing this program in the aim of learning how to write longer/cleaner programs, learn how to design my program to work via terminal (instead of GUI), and an excuse to learn selenium as well (although I do think it has some practical applications as well).

Source Link

asked Jul 3, 2020 at 17:24

samman

491
2
8

DNA Translator and Verifier (using BLAST)

Proteins are chains of amino acids. Amino acids are coded by codons, a sequence of 3 DNA/RNA molecules. DNA also has 3 open reading frames. This is basically the DNA sequence, but shift it by 1 (i.e. ignore the first entry). Thus, you will have 3 different translations (no skipping, skip 1st entry, skip 2nd entry). Additionally, for some sequencing techniques, the length of the DNA they can sequence is short. Thus, you may need to sequence forward, and backwards (-f and -r in my code). Finally, these amino acids sequences start with a specific codon, and end with specific codons.

This code takes the DNA, translates it to an amino acid using the start and stop codons as borders. It offers the user 3 options, either only forward sequencing or reverse sequencing (where the dna sequence needs to be reversed, and then complemented), or a combination using both the forward and reverse. If both is picked, the script then looks for a point of intersection, and combines the forward and reverse at that intersection. Furthermore, it offers the user to pick between all the potential sequences found. Finally, it uses BLAST to search the sequence picked against a database, to confirm the identity of the protein.

A basic schematic:

#DNA
AGTTGCGC
#translated 
1st reading frame: MC
2nd reading frame: VA
3rd reading frame: LR
#since only 1st reading frame has seq that starts with M
#sequence to search 
MC
#Blast will search MC

That's the basic idea.

To run: python script.py -f forward_file.txt -r reverse_file.txt The correct option to pick when presented with the translations is 1 and 0

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import sys

dna_codon_dict={'TTT':'F','TTC':'F',
                'TTA':'L','TTG':'L',
                'CTT':'L','CTC':'L',
                'CTA':'L','CTG':'L',
                'ATT':'I','ATC':'I',
                'ATA':'I','ATG':'M',
                'GTT':'V','GTC':'V',
                'GTA':'V','GTG':'V',
                'TCT':'S','TCC':'S',
                'TCA':'S','TCG':'S',
                'CCT':'P','CCC':'P',
                'CCA':'P','CCG':'P',
                'ACT':'T','ACC':'T',
                'ACA':'T','ACG':'T',
                'GCT':'A','GCC':'A',
                'GCA':'A','GCG':'A',
                'TAT':'Y','TAC':'Y',
                'CAT':'H','CAC':'H',
                'CAA':'Q','CAG':'Q',
                'AAT':'N','AAC':'N',
                'AAA':'K','AAG':'K',
                'GAT':'D','GAC':'D',
                'GAA':'E','GAG':'E',
                'TGT':'C','TGC':'C',
                'TGG':'W','CGT':'R',
                'CGC':'R','CGA':'R',
                'CGG':'R','AGT':'S',
                'AGC':'S','AGA':'R',
                'AGG':'R','GGT':'G',
                'GGC':'G','GGA':'G',
                'GGG':'G'}


DNA_complement_dict={'A':'T',
                     'T':'A',
                     'G':'C',
                     'C':'G',
                     'N':'N'}

def load_file(files):
    codon_list=[]
    with open(files) as seq_result:
        for lines in seq_result:
            if lines.startswith('>') is True:
                continue
            remove_white_spaces=lines.strip().upper()
            for codon in remove_white_spaces:
                codon_list.append(codon)
    return codon_list

def rev(files):
    reverse_codon_list=[]
    codon_list=load_file(files)
    codon_list.reverse()
    for codons in codon_list:
        reversed_codon=DNA_complement_dict[codons]
        reverse_codon_list.append(reversed_codon)
    return reverse_codon_list

def codon_translation(global_codon_list):
    codon_counter=0
    codon_triple_list=[]
    open_reading_frame_lists=[[],[],[],]
    for i in range(3):
        open_reading_frame_count=1
        codon_triple_list.clear()
        codon_counter=0
        for codons in global_codon_list:
            if open_reading_frame_count>=(i+1):
                codon_counter+=1
                codon_triple_list.append(codons)
                if codon_counter == 3:
                    codon_counter=0
                    join_codons=''.join(codon_triple_list)
                    try:
                        amino_acid=dna_codon_dict[join_codons]
                        open_reading_frame_lists[i].append(amino_acid)
                    except:
                        pass
                    if join_codons in {'TAA','TAG','TGA'}:
                        open_reading_frame_lists[i].append('X')
                    codon_triple_list.clear()
            else:
                open_reading_frame_count+=1
    return open_reading_frame_lists

def find_open_reading_frames(global_codon_list):
    sequences_to_search=[]
    sequence_to_add_to_search_list=[]
    add_to_string=False
    for open_reading_frames in codon_translation(global_codon_list):
        for amino_acids in open_reading_frames:
            if amino_acids == 'M':
                add_to_string=True
            if add_to_string is True:
                sequence_to_add_to_search_list.append(amino_acids)
                if amino_acids == 'X':
                    add_to_string=False
                    if len(sequence_to_add_to_search_list)>0:
                        sequences_to_search.append(''.join(sequence_to_add_to_search_list))
                        sequence_to_add_to_search_list.clear()
                    else:
                        sequence_to_add_to_search_list.clear()
    return sequences_to_search

def forward_loop():
    files=sys.argv[2]
    forward_flag=False
    if sys.argv[1] == '-f':
        forward_flag=True
    if forward_flag is True:
        codon_list=load_file(files)
        return codon_list

def reverse_loop():
    if sys.argv[1] == '-f':
        revsere_flag=False
        try:
            if sys.argv[3] == '-r':
                files=sys.argv[4]
                reverse_flag=True
            if reverse_flag is True:
                codon_list=rev(files)
                return codon_list
        except:
            pass
    else:
        files=sys.argv[2]
        reverse_flag=False
        if sys.argv[1] == '-r':
            reverse_flag=True
        if reverse_flag is True:
            codon_list=rev(files)
            return codon_list



def overlay(sequence_list1,sequence_list2):
    new_list1=[word for line in sequence_list1 for word in line]
    new_list2=[word for line in sequence_list2 for word in line]
    temp_list=[]
    modified_list1=[]
    counter=0
    for x in new_list1:
        temp_list.append(x)
        modified_list1.append(x)
        counter+=1
        if counter >= 5:
            if temp_list == new_list2[0:5]:
                break
            else:
                temp_list.pop((0))

    del new_list2[0:5]
    return ''.join(modified_list1+new_list2)


sequence_list1=[]
sequence_list2=[]
global_codon_list=[]
def main_loop():
    global global_codon_list
    global sequence_list1
    global sequence_list2
    if sys.argv[1] == '-f':
        global_codon_list=forward_loop()
        sequences_to_search=find_open_reading_frames(global_codon_list)
        sequence_to_search=[]
        for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))):
            print(f'row {number} sequence: {sequence}')
            sequence_to_search.append(sequence)
        pick_sequence_to_search=input('indicate which row # sequence to search: ')
        sequence_list1.append(sequence_to_search[int(pick_sequence_to_search)])
        try:
            if sys.argv[3] == '-r':
                global_codon_list=reverse_loop()
                sequences_to_search=find_open_reading_frames(global_codon_list)
                sequence_to_search=[]
                for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))):
                    print(f'row {number} sequence: {sequence}')
                    sequence_to_search.append(sequence)
                pick_sequence_to_search=input('indicate which row # sequence to search: ')
                sequence_list2.append(sequence_to_search[int(pick_sequence_to_search)])
        except:
            pass
    else:
        sequence_to_search=[]
        global_codon_list=reverse_loop()
        sequences_to_search=find_open_reading_frames(global_codon_list)
        for sequence,number in zip(sequences_to_search,range(len(sequences_to_search))):
            print(f'row {number} sequence: {sequence}')
            sequence_to_search.append(sequence)
        pick_sequence_to_search=input('indicate which row # sequence to search: ')
        sequence_list1.append(sequence_to_search[int(pick_sequence_to_search)])

main_loop()
driver = webdriver.Chrome()
driver.get('https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome')
fill_box = driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[3]/fieldset/div[1]/div[1]/textarea')
fill_box.clear()
fill_box.send_keys(overlay(sequence_list1,sequence_list2))
sumbit_button=driver.find_element_by_xpath('/html/body/div[2]/div/div[2]/form/div[6]/div/div[1]/div[1]/input')
sumbit_button.click()

#DNA forward 
>Delta_fl_pETDuet_1F

NNNNNNNNNNNNNNNNANTTAATACGACTCACTATAGGGGAATTGTGAGCGGATAACAATTCCCCTCTAGAAATAATTTT
GTTTAACTTTAAGAAGGAGATATACCATGGGCAGCAGCCATCACCATCATCACCACAGCCAGGATCCAATGATTCGGTTG
TACCCGGAACAACTCCGCGCGCAGCTCAATGAAGGGCTGCGCGCGGCGTATCTTTTACTTGGTAACGATCCTCTGTTATT
GCAGGAAAGCCAGGACGCTGTTCGTCAGGTAGCTGCGGCACAAGGATTCGAAGAACACCACACTTTTTCCATTGATCCCA
ACACTGACTGGAATGCGATCTTTTCGTTATGCCAGGCTATGAGTCTGTTTGCCAGTCGACAAACGCTATTGCTGTTGTTA
CCAGAAAACGGACCGAATGCGGCGATCAATGAGCAACTTCTCACACTCACCGGACTTCTGCATGACGACCTGCTGTTGAT
CGTCCGCGGTAATAAATTAAGCAAAGCGCAAGAAAATGCCGCCTGGTTTACTGCGCTTGCGAATCGCAGCGTGCAGGTGA
CCTGTCAGACACCGGAGCAGGCTCAGCTTCCCCGCTGGGTTGCTGCGCGCGCAAAACAGCTCAACTTAGAACTGGATGAC
GCGGCAAATCAGGTGCTCTGCTACTGTTATGAAGGTAACCTGCTGGCGCTGGCTCAGGCACTGGAGCGTTTATCGCTGCT
CTGGCCAGACGGCAAATTGACATTACCGCGCGTTGAACAGGCGGTGAATGATGCCGCGCATTTCACCCCTTTTCATTGGG
TTGATGCTTTGTTGATGGGAAAAAGTAAGCGCGCATTGCATATTCTTCAGCAACTGCGTCTGGAAGGCAGCGAACCGGTT
ATTTTGTTGCGCACATTAN

#DNA Reverse
>Delta_FL_pETDuet_R-T7-Term_B12.ab1
NNNNNNNNNNNNNAGCTGCGCTAGTAGACGAGTCCATGTGCTGGCGTTCAAATTTCGCAGCAGCGGTTTCTTTACCAGAC
TCGAGTTAACCGTCGATAAATACGTCCGCCAGGGGTTTATGGCACAACAGAAGAGATAACCCTTCCAGCTCTGCCCACAC
TGACTGACCGTAATCTTGTTTGAGGGTGAGTTCCGTTCGTGTCAGGAGTTGCACGGCCTGACGTAACTGCGTCTGACTTA
AGCGATTTAACGCCTCGCCCATCATGCCCCGGCGGTTCTGCCATACCCGATGCTTATCAAACAACGCACGCAGTGGCGTA
TGGGCAGACTGGCGTTTCAGGTTAACCAGTAACAACAGTTCACGTTGTAATGTGCGCAACAAAATAACCGGTTCGCTGCC
TTCCAGACGCAGTTGCTGAAGAATATGCAATGCGCGCTTACTTTTTCCCATCAACAAAGCATCAACCCAATGAAAAGGGG
TGAAATGCGCGGCATCATTCACCGCCTGTTCAACGCGCGGTAATGTCAATTTGCCGTCTGGCCAGAGCAGCGATAAACGC
TCCAGTGCCTGAGCCAGCGCCAGCAGGTTACCTTCATAACAGTAGCAGAGCACCTGATTTGCCGCGTCATCCAGTTCTAA
GTTGAGCTGTTTTGCGCGCGCAGCAACCCAGCGGGGAAGCTGAGCCTGCTCCGGTGTCTGACAGGTCACCTGCACGCTGC
GATTCGCAAGCGCAGTAAACCACGCGGCATTTTCTTGCGCTTTGCTTAATTTATTACCGCGGACGATCAACAGCNNNCGT
CATGCAGAAGTCCGGTGAGTGTGAGAAGTTGCTCATNGATCGCCCGCATTCGGNCCGTTTTCTGGTANCANCAGNNATAC
CGTTTGTCGANTGGCAAACANACN

python

Stack Exchange Network

Return to Question

DNA Translator and Verifier (using BLAST)