13
votes
Filter out ambiguous bases from a DNA sequence
Perhaps there aren't enough test cases here, because I don't see why you can't just use:
...
11
votes
Accepted
Implementing a DNA codon table in C
An array of char* is not a great data structure for storing AA or codon sequence data. A pointer takes 4 or 8 bytes per codon, but there are only 64 possible ...
11
votes
Accepted
Counting relevant entries in a large bioinformatics file
In order to speed this up, you'll need to avoid as many string creation operations as possible, because they are expensive. Especially the split operation is expensive. Not only does this create many ...
10
votes
Counting relevant entries in a large bioinformatics file
If you are in for raw performance, try to avoid repeating potentially cost-intensive operations.
In this case, you split the lines twice with the same parameter, which repeatedly applies a regular ...
10
votes
Accepted
Counting the occurrences of certain amino-acids in a file
Add functions
Everything is currently in the global namespace, you can add a few functions to split up the code and make it more readable.
You could use a ...
9
votes
Translate nucleic acid sequence into its corresponding amino acid sequence
This covers an interesting topic. Great work!
Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and ...
9
votes
Accepted
Simple mutation simulation for use in science class
Naming
As already mentioned, you should follow the PEP 8 style guide. But simply converting the leading character of a name to lower case might not be what you want to do in all cases. In function <...
8
votes
FASTA-to-tsv conversion script
Welcome to Code Review!
I'll add to the other answer from Reinderien.
PEP-8
In python, it is common (and recommended) to follow the PEP-8 style guide for writing clean, maintainable and consistent ...
8
votes
Filter out ambiguous bases from a DNA sequence
There's an inconsistency between the two functions. check_and_clean_sequence() has an alphabet parameter, but this isn't used ...
7
votes
Implementing a DNA codon table in C
The line
(*aminoacid_string) = malloc(aminoacid_count);
allocates aminoacid_count of bytes. The code needs that many ...
7
votes
Implementing a DNA codon table in C
You should try to refactor the code to have less hard-coded constants. E.g. nucleobase_to_aminoacid has both tcag and codon_table hard-coded.
That is in general something that hinders re-use.
You ...
7
votes
Accepted
Highly nested bioinformatics processing
Non-code / very-high-level considerations.
The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code ...
7
votes
FASTA-to-tsv conversion script
Path?
Surely path is not a single path, since you loop through it. So at the least, this is poorly-named and should be paths. ...
7
votes
Accepted
Filter out ambiguous bases from a DNA sequence
From the bioinformatics side, not the python one: Your return will be non-useful for further processing whenever an ambiguous base has been present, because it changes index locations! You'll want to ...
7
votes
Simple mutation simulation for use in science class
The code adheres to many good coding practices already, and it should
be simple for beginners to follow. Here are some minor suggestions.
Documentation
The PEP 8 style guide recommends
adding ...
6
votes
Accepted
Genetic Sequence Visualizer - Generating large images
Parser
Your parser has a bug in line 62:
raw = ''.join([n for n in file.readlines() if not n.startswith('>')]).replace('\n', "").lower()
will ...
6
votes
6
votes
Implementing a DNA codon table in C
In addition to current (and future) answers:
You use malloc(), but from what I can see, you do not free() it later on. In my ...
6
votes
Accepted
Filtering FASTQ file based on read names from other file (how to increase performance) Python
Do not reinvent the wheel. There are bioinformatics tools that accomplish this task.
To extract reads from fastq files by IDs, use seqtk subseq.
Extract sequences ...
6
votes
Accepted
Rust program to one hot encode genetic sequences from .fa files
Your programs aren't quite equivalent; one looks at whether a line starts with >, the other looks for chr in each line. That ...
6
votes
Simple mutation simulation for use in science class
I think the most confusing line for a beginner is:
Mutate_mat = list(map(list, zip(*Mutate_mat)))
I would personally remove the map for beginners in a course that ...
5
votes
Accepted
Genomic Range Query in Python
I'm not sure that I trust Codility's detected time complexity. As far as I know, it's not possible to programmatically calculate time complexity, but it is possible to plot out a performance curve ...
5
votes
Accepted
DNA reverse complement as fast as possible
The main operation is substitution via a small table, which is also what _mm_shuffle_epi8 does. The low 4 bits of the indexes clash though, and I could not find an ...
5
votes
Mapping DNA nucleotides into two-dimensional coordinates
First of all I'd like to say that your code is fast. From studying it, the major bottlenecks that I've found come from the input being a string, and from converting to numpy arrays and back.
Since ...
5
votes
Accepted
Lazy Loading a Bioinformatic SAM record
Performance
There is one thing that I believe could increase the performance of your application.
You often call findElement, which goes through the SAM record ...
5
votes
Accepted
Run an external program and extract a pattern match along with the result file
Not looking bad for as far as I can see. If the example file is accurate for the lengths of the input files, then I don't forsee any real problems, though others may of course disagree.
Naming:
<...
5
votes
Accepted
5
votes
Accepted
FASTA-to-tsv conversion script
In addition to the points raised in the other answers:
Extraneous import
The first line is import sys, but I don't see sys used ...
5
votes
Accepted
Counting the number of k-mers like monomers, dimers to hexamers from the fasta file
The code can be simplified quite a bit.
Using itertools.product, the code like this:
...
Only top scored, non community-wiki answers of a minimum length are eligible
Related Tags
bioinformatics × 153python × 78
performance × 34
beginner × 28
python-3.x × 22
algorithm × 19
strings × 14
programming-challenge × 12
parsing × 12
java × 11
c++ × 9
c × 9
ruby × 9
regex × 9
csv × 9
r × 8
perl × 7
time-limit-exceeded × 6
rust × 6
file × 6
numpy × 6
statistics × 5
edit-distance × 5
c# × 4
object-oriented × 4