I've been giving my grade tens an introduction to computational phylogenomics, based around computing dissimilarity between DNA strands, and a simplified model of mutation. They've been exploring how mutations introduce variation, and how dissimilarity with the progenitor increases with the number of generations. We're working towards and understanding of a molecular clock.
So far they've developed "algorithms" of their own and been doing this by hand, but next I want to give them a tool to run multiple simulations, so they can do some statistics so they can arrive at conclusions with more confidence. Rather than just giving them a black box that will magically produce results, I want them to be able to relate the computer algorithms with the algorithms they've been using, and thereby having a better understanding of how it works. Most of the students have little experience with actual programming, so I want to keep the code as simple and intuitive as possible, with few abstractions.
For example, I'm not too happy about the use of zip() and how the lists are "transposed", but that might be a necessary evil of not having proper arrays. If the code can be made less efficient and more verbose, in order to make it easier for a novice to read and understand, I want that. I've also steered away from using modules that aren't in the standard Python 3.x distribution, as I don't want to deal with the extra hassle, otherwise I probably would have used numpy. The plan is for them to simply copy/paste the code into the Python interpreter in Terminal (MacOS).
import random
import os
import csv
# Function that counts number of dissimilarities
# between two sequences
def diss(x, y):
n = 0
for a, b in zip(x, y):
if a != b:
n += 1
return (n)
# Function that mutates a string (Dna) Gen times.
# On average 0.75 mutations per generation, as the resampling will
# pick the existing nucleotide in one out of four times.
def mutate(Dna, Gen):
# Store a copy of the original DNA sequence that
# the mutated sequence can be compared to.
Dna_o = Dna[:]
Len = len(Dna)
Dis = []
# For loop that mutates the sequence Gen times and calculates the
# dissimilarity after each mutation.
for _ in range(Gen):
# Random location for mutation
Mut = random.randint(0, Len-1)
# Randomly selected nucleotide
Dna[Mut] = random.choice(['A', 'T', 'G', 'C'])
# Calculate proportional dissimalirity
Dis.append(diss(Dna, Dna_o)/Len)
# Print proportional dissimilarities.
return (Dis)
# Iterate (repeat) the mutate function i times
# Outputs a nested list
def mutate_i(Dna, Gen, i):
Mat = []
for _ in range(i):
Mat.append(mutate(Dna, Gen))
return (Mat)
# Function for helping export simulation result as CSV file.
def export_mutate(Mutate_mat, Filename="mutation_simulation.csv"):
# Add a count of the generations
Mutate_mat.insert(0, list(range(1, len(Mutate_mat[0])+1)))
# Transpose the list (swap rows and columns)
Mutate_mat = list(map(list, zip(*Mutate_mat)))
# Add headers
header = ["Generation"]
for i in range(1, len(Mutate_mat[0])):
header.append("sim_" + str(i))
Mutate_mat.insert(0, header)
# Export simulation data as a CSV file that can be imported to
# Google Sheets
with open(Filename, 'w') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerows(Mutate_mat)
# Print the path to the CSV file
return (os.path.join(os.getcwd(), Filename))
# # # Simulation start
# Original DNA sequence
Dna = list("ATGC" * 4)
# Set random seed so the simulation is repeatable
random.seed(1)
# Generate four simulations, each ten generations long
Mutate_mat = mutate_i(Dna, 12, 4)
# Export simulation results
export_mutate(Mutate_mat)
Produces this output:
Generation,sim_1,sim_2,sim_3,sim_4
1,0.0,0.0625,0.0625,0.0625
2,0.0,0.125,0.125,0.125
3,0.0,0.125,0.125,0.125
4,0.0,0.1875,0.125,0.1875
5,0.0625,0.25,0.1875,0.25
6,0.125,0.25,0.25,0.3125
7,0.1875,0.3125,0.3125,0.3125
8,0.25,0.3125,0.375,0.3125
9,0.3125,0.375,0.375,0.3125
10,0.375,0.375,0.375,0.375
11,0.3125,0.3125,0.4375,0.375
12,0.3125,0.3125,0.5,0.375