Optimizing getitem for a bioinformatics script in Python

Question

I'm writing a script for a bioinformatics application in Python that iterates through several sequences looking for amino acids in specific positions to calculate their frequency in relation to a genomic database. However, I'm new to the area and I've been learning while developing the project, so a lot of what I've done is poorly optimized. For a task that I consider relatively simple, the script takes half a minute for files and sometimes much longer for larger files

Can anyone suggest how I might try to get around this problem? Apologies if it's trivial, I'm a biologist just starting to get to grips with programming

def main(self):
    # Process input data and extract necessary information
    mutation_count, mutation_subjects, mutation_positions, alignment_file, seq_ID = self.process_input()

    # Initialize a dictionary to count amino acids at each mutation position
    aa_counts = {pos: {} for pos in mutation_positions}

    # Read the alignment file using Bio.AlignIO
    alignment = AlignIO.read(alignment_file, 'clustal')
    
    # Find the reference sequence in the alignment
    ref_seq = None
    for record in alignment:
        if record.id == seq_ID:
            ref_seq = record
            break

    # Convert sequences in the alignment to strings for easier processing
    ref_seq_str = str(ref_seq.seq)
    alignment_str = {record.id: str(record.seq) for record in alignment}

    # Count amino acids at each position in the alignment
    for seq in alignment_str.values():
        pos_in_ref_seq = 0
        for pos, aa in enumerate(seq):
            if ref_seq_str[pos] != '-':
                pos_in_ref_seq += 1

            if pos_in_ref_seq in aa_counts:
                aa_counts[pos_in_ref_seq][aa] = aa_counts[pos_in_ref_seq].get(aa, 0) + 1
                    
    # If a specific position is provided, calculate and print the amino acid frequencies at that position
    if self.args.position:
        position = self.args.position
        total_count = sum(aa_counts[position].values())
        for aa, count in aa_counts[position].items():
            freq = (count / total_count * 100) if total_count > 0 else 0
            print(f"Amino Acid: {aa} | Frequency: {freq:.2f}% | Count: {count}")
        return
                
    # Analyze mutations and calculate frequencies
    mutation_info = {}
    for mutation, count in sorted(mutation_count.items(), key=lambda x: int(x[0][1:-1])):
        seq_pos = int(mutation[1:-1])
        query_aa = mutation[0]
        subject_aa = mutation[-1]
        
        if query_aa == '-' or subject_aa == '-':
            continue
        
        real_count = aa_counts[seq_pos].get(subject_aa, 0)
        total_count = sum(aa_counts[seq_pos].values())
        query_aa_frequency = (aa_counts[seq_pos].get(query_aa, 0) / total_count * 100) if total_count > 0 else 0
        subject_aa_frequency = (aa_counts[seq_pos].get(subject_aa, 0) / total_count * 100) if total_count > 0 else 0
        
        if subject_aa_frequency <= 10 and real_count > 2:
            mutation_info[mutation] = {
                'query_aa_frequency': query_aa_frequency,
                'subject_aa_frequency': subject_aa_frequency,
                'real_count': real_count,
            }
    
    # Identify strains with specific mutations
    strains_with_mutations = {}
    for mutation in mutation_count:
        query_aa, pos, subject_aa = mutation[0], int(mutation[1:-1]), mutation[-1]
        strains_with_this_mutation = []

        for record in alignment:
            strain_name = record.id[:3] 
            sequence = str(record.seq)
            
            pos_in_ref_seq = 0
            
            for i, aa in enumerate(sequence):
                if ref_seq.seq[i] != '-':
                    pos_in_ref_seq += 1
                
                if pos_in_ref_seq == pos:
                    
                    if aa == subject_aa:
                        strains_with_this_mutation.append(strain_name[:3])
                    break  
        strains_with_mutations[mutation] = strains_with_this_mutation
            
    # Write mutation information to a text file
    with open('gapslist.txt', 'w') as f:
        for mutation, info in mutation_info.items():
            f.write(f"Mutation: {mutation} | {mutation[0]} Frequency: {info['query_aa_frequency']:.2f}% | {mutation[-1]} Frequency: {info['subject_aa_frequency']:.2f}% | Count: {info['real_count']}\n\n")

    # Write mutation and strain information to a CSV file
    with open('rare_mutations.csv', 'w', newline='') as csvfile:
        fieldnames = ['Mutation', 'Strains']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        
        for mutation, info in mutation_info.items():
            writer.writerow({
                'Mutation': mutation,
                'Strains': ';'.join(strains_with_mutations.get(mutation, []))
            })

Profiling the code, I realized that the biggest bottleneck is the number of times the getitem function of the Seq class in Bio.Seq is being called:

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)

 69006091  25.002  0.000     99.135   0.000   Seq.py:470(__getitem__)

 69098925  6.311   0.000     6.311    0.000   SeqRecord.py:334(<lambda>).

 69008225  10.165  0.000     49.965   0.000   <frozen abc>:117(__instancecheck__)

 69006091  10.330  0.000     22.310   0.000   <frozen abc>:121(__subclasscheck__)

I've been looking into ways of dealing with this but nothing seems to be working, or it's only marginally reducing execution time.

J_H · Accepted Answer · 2023-11-28 21:18:17Z

new to the area and I've been learning while developing

Great, welcome! Software is almost as interesting as biology. And it's much less messy.

measurements

much longer for larger files

“When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science.” --Lord Kelvin

You're saying that "app-level throughput is low". That would be a more useful metric if it was in terms of "rows per second" or similar units that are meaningful to the application.

I am delighted to see that you found python -m cProfile! Good job, keep using that. But follow the advice below to make it more useful to you. I see a measurement of 99 seconds getting sequence items, but that happens in more than one place so I don't really know where your app-level hot spot is. Do pay attention to the 25 s versus 99 s on total versus cumulative. That suggests that three quarters of the time spent de-referencing x[i] is consumed by still lower layers, which surprises me slightly.

Here's an idiom that will help you narrow down app-level bottlenecks.

from time import time
...
    t0 = time()
    do_some_time_consuming_stuff()
    print("elapsed seconds:", time() - t0)

Record such timings, author refactored functions, and do a bake-off to measure whether the refactoring is much of a win.

When authors introduce the concept of a python @decorator, they often will start with an elapsed time example. You may find that helpful, as well.

helper functions

You don't have any.

Now, main() is a good looking function, with many helpful identifiers. But it is entirely too long. In particular, we must scroll vertically if we are to visually take it all in. If there's one software engineering skill you should learn this year, it is how to Extract Helper functions. Fortunately it's not hard, it just takes practice.

One reason for breaking out helpers is coupling of variables. All of your local variables are essentially global. Having many such variables makes it harder to correctly reason about the code, since so many variables and side effects can affect how the current line of code behaves.

Your code is nicely organized into distinct processing chunks, and you wrote many helpful comments.

    # Find the reference sequence in the alignment
    ...

    # Count amino acids at each position in the alignment
    ...
                    
    # If a specific position is provided, calculate and print the amino acid frequencies at that position
    ...
                
    # Analyze mutations and calculate frequencies
    ...
    
    # Identify strains with specific mutations
    ...
            
    # Write mutation information to a text file
    ...

    # Write mutation and strain information to a CSV file
    ...

Every single one of those comments is telling me: "Here comes a new function!" Listen to the code. Break them out as small functions. The English sentence you wrote is practically giving you the function name, e.g. def write_mutation_and_strain_to_csv( ... ):. It's not always necessary, but feel free to copy-paste those very nice sentences into a """docstring"" for each helper function.

review context

You stripped out the import statements, so it's hard to see exactly what your deps are. But it appears for example that at least one of them is the AlignIO library.

In a way it doesn't matter too much, since we don't have a reproducible example. If I suggest a change, it's not like I can run and re-run the modified code to measure size-of-effect.

Consider offering more context if you post any subsequent questions about refactored code.

test suite

Writing automated unit tests is a good way to instill confidence in your code's correctness and to notice any regressions introduced by refactoring. Some would say that it "gives us the courage to refactor" safely.

In your requirements document you wrote down more than one thing which each chunk of code must do in order to be Correct. For example, it must not only find proper strains or write a file, but it must do it fast. So a unit test might simply verify that a file was processed in less than a minute. And then that Green bar turns Red if a refactor introduces surprisingly large delays.

big-Oh complexity

    for mutation in mutation_count:
        ...
        for record in alignment:
            ...            
            for i, aa in enumerate(sequence):

I don't know why your code is slow -- it is large and complex. But I notice nested loops in several places. We're looping over mutation_count, alignment, and sequence. Call their lengths m, a, s. If this chunk was broken out as a helper function, you might note in the """docstring""" or in a # comment that time complexity is O(m × a × s). If you know the rough magnitudes of those inputs, that can help guide you toward places where algorithm improvements are needed.

@greybeard, thank you for helping me to sharpen my quill, which I plucked from the copy-pasta monster and dipped once again into the ink well. I remember being dissatisfied with where the commas were placed as I read it, which I think made me not even re-read it to the end. I will cite my reference; revised text comes from an improved source. — J_H
– J_H, Commented Nov 28, 2023 at 21:24
Lord Kelvin 1824–1907 British scientist When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be. often quoted as ‘If you cannot measure it, then it is not science’ Popular Lectures and Addresses vol. 1 (1889) ‘Electrical Units of Measurement’, delivered 3 May 1883 — pippo1980
– pippo1980, Commented Nov 29, 2023 at 16:38

Stack Exchange Network

Optimizing getitem for a bioinformatics script in Python

1 Answer 1

measurements

helper functions

review context

test suite

big-Oh complexity

You must log in to answer this question.

Hot Network Questions

Optimizing __getitem__ for a bioinformatics script in Python

1 Answer 1

measurements

helper functions

review context

test suite

big-Oh complexity

You must log in to answer this question.

Related

Hot Network Questions

Optimizing getitem for a bioinformatics script in Python