1

I have a big csv file with DNA sequences (sequence of characters) like this small example (infile.csv):

2840,GTGGCCCGGGAGGCC
291,GCATGTCCGTAGGTTCGT
147,GCATGTCCG

I need to translate each DNA sequence to peptide sequence (using the below function) and add the 3rd column which will be the peptide sequence. here is the expected output:

2840,GTGGCCCGGGAGGCC,VAREA
291,GCATGTCCGTAGGTTCGT,ACP*VR
147,GCATGTCCG,ACP

to do so, I made small following code:

import pandas
df = pandas.read_csv('infile.csv')
seq = csv_data[1]

def translate(seq):
    table = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
        'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
    }
    protein =""
    if len(seq)%3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein+= table[codon]
    return protein


peptide_seq=translate(seq)
df[peptide_seq]
df.to_csv("outfile.csv")

but it does not return the expected output. do you know how I can change the code to get the expected output?

2
  • 1
    What is the script returning? Commented Sep 28, 2021 at 17:38
  • Please include a minimal reproducible example, which includes possible input data, the results and the expected results. Commented Sep 28, 2021 at 18:17

1 Answer 1

2
import pandas

def translate(seq):
    table = {
        'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M',
        'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T',
        'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K',
        'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': 'R',
        'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L',
        'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P',
        'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q',
        'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT': 'R',
        'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V',
        'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A',
        'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E',
        'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT': 'G',
        'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S',
        'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L',
        'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*',
        'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'TGG': 'W',
    }
    protein = ""
    if len(seq) % 3 == 0:
        for i in range(0, len(seq), 3):
            codon = seq[i:i + 3]
            protein += table[codon]
    return protein

# reading the csv without header, so columns names will be 0 and 1, then it makes the first column as index in df
df = pandas.read_csv('infile.csv', header=None, index_col=0)
# get the second column as Series and apply the function to each element
# result Series will be the new column 'peptide_seq'
df['peptide_seq'] = df[1].apply(translate)
# save the result df without header to get the target output
df.to_csv('outfile.csv', header=None)

Output:

2840,GTGGCCCGGGAGGCC,VAREA
291,GCATGTCCGTAGGTTCGT,ACP*VR
147,GCATGTCCG,ACP
Sign up to request clarification or add additional context in comments.

1 Comment

Your answer would be so much better if you actually explain what you're doing differently and why, instead of just "this will work".

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.