I have been attempting to learn Python for the last few months, and one thing that has helped me is dabbling in random projects. To that end, I wrote a very rudimentary script for simulating the evolution of a DNA sequence (click on Summary) to show how descent with modification produces a nested hierarchy.
Reminder: I’m just a beginner, so forgive any poor scripting on my part (already made two edits within 20 minutes of posting, apologies if you saw previous code):
latest version 0.1 in post 8
Summary
import string
import random
# starting sequence
ancestral = 'atggaacgccatatgcgcccggcccaagatataggttggcagcccgaaacagcctcaaccccacaccttgacgttgatcgcgtaagccgcccgatgtgcatacctgtgttcagcatagggataaaactctggtcttccataaaggcacccgttcgttactcggcccagactgcatacctgtccgaccggggcttcccgtcttcccctttaagtagtcctcttcgcggattctcgcgacaagttgtcctaggcggatggagcccacacgaattcccaagctttcgcggaataatatgggtagccatgcttatttttttatcctggcaactcgtacagcctagtaactcctcgcgcccggctaagggcgggcacgtgctgaacttcagggcacgtacgttggctacgaagtacacagaagacgttgctttaggatctcgcaggcgtcggggacggtacggatacccagtgacacaaaacgtaaaaagtgaactcgaaacggatcagtcacgcacttacgtacccttcgcctcaggtggatataagccgatttcggtccttgggccatggtctcacagatgcctaccgaaaattttactataa'
# initializes species list A-Z
specieslist = list(string.ascii_uppercase)
# initializes tree in Newick format
speciestree = '(A)'
# modifies species tree for new speciation events
# send species that is speciating, name of offspring species, and the current tree
# returns modified tree in Newick format
def splitspecies(species, offspring, tree):
speciate = species
newbranch = offspring
tree = tree.replace(speciate, f'({speciate}, {newbranch})')
return(tree)
# randomly chooses one base and its replacement i.e. snp
# returns mutated sequence as string
def mutateseq(seq):
seqlist = list(seq)
baseset = {'a', 't', 'c', 'g'}
seqlen = len(seq)
mutbasenum = random.randrange(0, seqlen, 1)
mutbase = set(seq[mutbasenum])
mutation = random.choice(list(baseset.difference(mutbase)))
seqlist[mutbasenum] = mutation
outseq = ''.join(seqlist)
return(outseq)
# initializes dictionary of sequences for species, sets A as starting sequence
speciesdict = {k:'' for k in specieslist}
speciesdict['A'] = ancestral
# a generation is a round of speciation events and mutations
generations = 1000
# for each generation, speciationrate determines chance of specciation
# in each generation, all sequences are mutated at one base
# loop is stopped if there are 10 or more species at start of new loop
speciationrate = 5
listcounter = 1
for i in range(generations):
if listcounter >= 10:
break
for k in speciesdict:
if speciesdict[k] != '':
speciatechance = random.randrange(0, speciationrate, 1)
if speciatechance == 0:
speciesdict[specieslist[listcounter]] = mutateseq(speciesdict[k])
speciesdict[k] = mutateseq(speciesdict[k])
speciestree = splitspecies(k, specieslist[listcounter], speciestree)
listcounter += 1
else:
speciesdict[k] = mutateseq(speciesdict[k])
# data clean up and printing results in Newick and fasta formats
newicklist = list(speciestree)
del newicklist[0]
del newicklist[-1]
newick = ''.join(newicklist)
print(f'{newick};')
for k, v in speciesdict.items():
if v != '':
print(f'>{k}')
print(v)
For those unfamiliar with Python, I would suggest using Jupyter notebook which is part of the Anaconda package. You can also run the script online at Online Python Compiler (Interpreter) .
The script starts with a random open reading frame. In each iteration there are two things that happen. First, a random number is chosen between 0 and speciationrate. If the number is 0 then a branch splits. Second, a base is randomly mutated in every branch. The script also keeps track of the real ancestral history of the species group in Newick format.
The script prints out the real ancestral history and the evolved sequences in fasta format. For visualization, I have been using Trex-online, and for constructing a phylogeny from the sequence data I have been using Simple Phylogenetic Tree < Phylogeny < EMBL-EBI .
The tree constructed from sequence closely matches the real ancestral history. The only difference is the deepest nodes which are flattened out in the sequence phylogeny tool. Again, this is a very crude simulation so I don’t factor in such things as mutational bias, indels, selection, differing mutation rates, and so forth. I am also a newb when it comes to phylogenetics, so any suggestions for different phylogenetic analyses would be greatly helpful.
Please feel free to suggest changes and improvements to the script itself.