Skip to main content
13 votes

Filter out ambiguous bases from a DNA sequence

Perhaps there aren't enough test cases here, because I don't see why you can't just use: ...
Kraigolas's user avatar
  • 927
11 votes
Accepted

Implementing a DNA codon table in C

An array of char* is not a great data structure for storing AA or codon sequence data. A pointer takes 4 or 8 bytes per codon, but there are only 64 possible ...
Peter Cordes's user avatar
  • 3,761
11 votes
Accepted

Counting relevant entries in a large bioinformatics file

In order to speed this up, you'll need to avoid as many string creation operations as possible, because they are expensive. Especially the split operation is expensive. Not only does this create many ...
RoToRa's user avatar
  • 11.6k
10 votes

Counting relevant entries in a large bioinformatics file

If you are in for raw performance, try to avoid repeating potentially cost-intensive operations. In this case, you split the lines twice with the same parameter, which repeatedly applies a regular ...
mtj's user avatar
  • 5,002
10 votes
Accepted

Counting the occurrences of certain amino-acids in a file

Add functions Everything is currently in the global namespace, you can add a few functions to split up the code and make it more readable. You could use a ...
Ludisposed's user avatar
  • 11.8k
9 votes

Translate nucleic acid sequence into its corresponding amino acid sequence

This covers an interesting topic. Great work! Because I am unfamiliar with this area, I utilized your unit testing to ensure changes I make did not break functionality. If they do, I apologize and ...
esote's user avatar
  • 3,800
9 votes
Accepted

Simple mutation simulation for use in science class

Naming As already mentioned, you should follow the PEP 8 style guide. But simply converting the leading character of a name to lower case might not be what you want to do in all cases. In function <...
Booboo's user avatar
  • 3,666
8 votes

FASTA-to-tsv conversion script

Welcome to Code Review! I'll add to the other answer from Reinderien. PEP-8 In python, it is common (and recommended) to follow the PEP-8 style guide for writing clean, maintainable and consistent ...
hjpotter92's user avatar
  • 8,921
8 votes

Filter out ambiguous bases from a DNA sequence

There's an inconsistency between the two functions. check_and_clean_sequence() has an alphabet parameter, but this isn't used ...
Toby Speight's user avatar
  • 88.4k
7 votes

Implementing a DNA codon table in C

The line (*aminoacid_string) = malloc(aminoacid_count); allocates aminoacid_count of bytes. The code needs that many ...
vnp's user avatar
  • 58.7k
7 votes

Implementing a DNA codon table in C

You should try to refactor the code to have less hard-coded constants. E.g. nucleobase_to_aminoacid has both tcag and codon_table hard-coded. That is in general something that hinders re-use. You ...
Hans Olsson's user avatar
7 votes
Accepted

Highly nested bioinformatics processing

Non-code / very-high-level considerations. The first thing you can do to speed up the performance would be to get a better computer. My computer is several years old, but it runs the unmodified code ...
Peter Taylor's user avatar
  • 24.5k
7 votes

FASTA-to-tsv conversion script

Path? Surely path is not a single path, since you loop through it. So at the least, this is poorly-named and should be paths. ...
Reinderien's user avatar
  • 71.1k
7 votes
Accepted

Filter out ambiguous bases from a DNA sequence

From the bioinformatics side, not the python one: Your return will be non-useful for further processing whenever an ambiguous base has been present, because it changes index locations! You'll want to ...
Bennie's user avatar
  • 186
7 votes

Simple mutation simulation for use in science class

The code adheres to many good coding practices already, and it should be simple for beginners to follow. Here are some minor suggestions. Documentation The PEP 8 style guide recommends adding ...
toolic's user avatar
  • 15.9k
6 votes
Accepted

Genetic Sequence Visualizer - Generating large images

Parser Your parser has a bug in line 62: raw = ''.join([n for n in file.readlines() if not n.startswith('>')]).replace('\n', "").lower() will ...
FirefoxMetzger's user avatar
6 votes

Counting relevant entries in a large bioinformatics file

Possible bug: ...
Imus's user avatar
  • 4,387
6 votes

Implementing a DNA codon table in C

In addition to current (and future) answers: You use malloc(), but from what I can see, you do not free() it later on. In my ...
esote's user avatar
  • 3,800
6 votes
Accepted

Filtering FASTQ file based on read names from other file (how to increase performance) Python

Do not reinvent the wheel. There are bioinformatics tools that accomplish this task. To extract reads from fastq files by IDs, use seqtk subseq. Extract sequences ...
Timur Shtatland's user avatar
6 votes
Accepted

Rust program to one hot encode genetic sequences from .fa files

Your programs aren't quite equivalent; one looks at whether a line starts with >, the other looks for chr in each line. That ...
AKX's user avatar
  • 241
6 votes

Simple mutation simulation for use in science class

I think the most confusing line for a beginner is: Mutate_mat = list(map(list, zip(*Mutate_mat))) I would personally remove the map for beginners in a course that ...
user286929's user avatar
5 votes
Accepted

Genomic Range Query in Python

I'm not sure that I trust Codility's detected time complexity. As far as I know, it's not possible to programmatically calculate time complexity, but it is possible to plot out a performance curve ...
mochi's user avatar
  • 1,144
5 votes
Accepted

DNA reverse complement as fast as possible

The main operation is substitution via a small table, which is also what _mm_shuffle_epi8 does. The low 4 bits of the indexes clash though, and I could not find an ...
user555045's user avatar
  • 12.4k
5 votes

Mapping DNA nucleotides into two-dimensional coordinates

First of all I'd like to say that your code is fast. From studying it, the major bottlenecks that I've found come from the input being a string, and from converting to numpy arrays and back. Since ...
maxb's user avatar
  • 1,582
5 votes
Accepted

Lazy Loading a Bioinformatic SAM record

Performance There is one thing that I believe could increase the performance of your application. You often call findElement, which goes through the SAM record ...
IEatBagels's user avatar
  • 12.7k
5 votes
Accepted

Run an external program and extract a pattern match along with the result file

Not looking bad for as far as I can see. If the example file is accurate for the lengths of the input files, then I don't forsee any real problems, though others may of course disagree. Naming: <...
Gloweye's user avatar
  • 1,746
5 votes
Accepted

GenBank to FASTA format using regular expressions without Biopython

Line iteration ...
Reinderien's user avatar
  • 71.1k
5 votes
Accepted

FASTA-to-tsv conversion script

In addition to the points raised in the other answers: Extraneous import The first line is import sys, but I don't see sys used ...
Mike's user avatar
  • 166
5 votes
Accepted

Counting the number of k-mers like monomers, dimers to hexamers from the fasta file

The code can be simplified quite a bit. Using itertools.product, the code like this: ...
RootTwo's user avatar
  • 10.7k

Only top scored, non community-wiki answers of a minimum length are eligible