Fun with Python: Sequence Evolution Simulation

I have been attempting to learn Python for the last few months, and one thing that has helped me is dabbling in random projects. To that end, I wrote a very rudimentary script for simulating the evolution of a DNA sequence (click on Summary) to show how descent with modification produces a nested hierarchy.

Reminder: I’m just a beginner, so forgive any poor scripting on my part (already made two edits within 20 minutes of posting, apologies if you saw previous code):

latest version 0.1 in post 8

Summary
import string
import random

# starting sequence
ancestral = 'atggaacgccatatgcgcccggcccaagatataggttggcagcccgaaacagcctcaaccccacaccttgacgttgatcgcgtaagccgcccgatgtgcatacctgtgttcagcatagggataaaactctggtcttccataaaggcacccgttcgttactcggcccagactgcatacctgtccgaccggggcttcccgtcttcccctttaagtagtcctcttcgcggattctcgcgacaagttgtcctaggcggatggagcccacacgaattcccaagctttcgcggaataatatgggtagccatgcttatttttttatcctggcaactcgtacagcctagtaactcctcgcgcccggctaagggcgggcacgtgctgaacttcagggcacgtacgttggctacgaagtacacagaagacgttgctttaggatctcgcaggcgtcggggacggtacggatacccagtgacacaaaacgtaaaaagtgaactcgaaacggatcagtcacgcacttacgtacccttcgcctcaggtggatataagccgatttcggtccttgggccatggtctcacagatgcctaccgaaaattttactataa'

# initializes species list A-Z
specieslist = list(string.ascii_uppercase)

# initializes tree in Newick format
speciestree = '(A)'

# modifies species tree for new speciation events
# send species that is speciating, name of offspring species, and the current tree
# returns modified tree in Newick format
def splitspecies(species, offspring, tree):    
    speciate = species
    newbranch = offspring 
    tree = tree.replace(speciate, f'({speciate}, {newbranch})')    
    return(tree)

# randomly chooses one base and its replacement i.e. snp
# returns mutated sequence as string
def mutateseq(seq):
    seqlist = list(seq)
    baseset = {'a', 't', 'c', 'g'}
    seqlen = len(seq)
    mutbasenum = random.randrange(0, seqlen, 1)
    mutbase = set(seq[mutbasenum])    
    mutation = random.choice(list(baseset.difference(mutbase)))       
    seqlist[mutbasenum] = mutation
    outseq = ''.join(seqlist)
    return(outseq)

# initializes dictionary of sequences for species, sets A as starting sequence
speciesdict = {k:'' for k in specieslist}
speciesdict['A'] = ancestral

# a generation is a round of speciation events and mutations
generations = 1000

# for each generation, speciationrate determines chance of specciation
# in each generation, all sequences are mutated at one base
# loop is stopped if there are 10 or more species at start of new loop
speciationrate = 5
listcounter = 1
for i in range(generations):    
    if listcounter >= 10:
        break
    for k in speciesdict:
        if speciesdict[k] != '':
            speciatechance = random.randrange(0, speciationrate, 1)
            if speciatechance == 0:
                speciesdict[specieslist[listcounter]] = mutateseq(speciesdict[k])
                speciesdict[k] = mutateseq(speciesdict[k])                                
                speciestree = splitspecies(k, specieslist[listcounter], speciestree)
                listcounter += 1
            else:                
                speciesdict[k] = mutateseq(speciesdict[k])
                
# data clean up and printing results in Newick and fasta formats
newicklist = list(speciestree)
del newicklist[0]
del newicklist[-1]
newick = ''.join(newicklist)
print(f'{newick};')
for k, v in speciesdict.items():
    if v != '':
        print(f'>{k}')
        print(v)
    

For those unfamiliar with Python, I would suggest using Jupyter notebook which is part of the Anaconda package. You can also run the script online at Online Python Compiler (Interpreter) .

The script starts with a random open reading frame. In each iteration there are two things that happen. First, a random number is chosen between 0 and speciationrate. If the number is 0 then a branch splits. Second, a base is randomly mutated in every branch. The script also keeps track of the real ancestral history of the species group in Newick format.

The script prints out the real ancestral history and the evolved sequences in fasta format. For visualization, I have been using Trex-online, and for constructing a phylogeny from the sequence data I have been using Simple Phylogenetic Tree < Phylogeny < EMBL-EBI .

The tree constructed from sequence closely matches the real ancestral history. The only difference is the deepest nodes which are flattened out in the sequence phylogeny tool. Again, this is a very crude simulation so I don’t factor in such things as mutational bias, indels, selection, differing mutation rates, and so forth. I am also a newb when it comes to phylogenetics, so any suggestions for different phylogenetic analyses would be greatly helpful.

Please feel free to suggest changes and improvements to the script itself.

3 Likes

So the higher the speciation rate, the less likely there is to be a speciation?

I’d have implemented mutateseq as a probability of a mutation at each position rather than exactly one mutation. If I was feeling ambitious I’d also have implemented insertions and deletions.

If I had implemented exactly one mutation, I wouldn’t have converted the genome from a string to a list and back again, I’d have just used substring manipulation.

I have a dozen or so simple EA’s implemented in Python if you’re interested.

1 Like

Correct.

Odd terminology. Also, let me comment on one other property of your simulation. Is it true that at each time-step you change one base in one lineage? Or is it one base per lineage? If the former, then the substitution rate drops as there are more and more lineages coexisting.

Fair criticism. I can probably think of a better variable name.

One base is changed in all sequences in each iteration. The sequences are kept in a dictionary and the for loop iterates over each sequence in the dictionary.

It would be very helpful to see how the professionals do it. 90% of the reason I wrote the script was to see if I could actually do it, so I am definitely eager to learn what I can.

Sequence Evolution Simulation (SES) version 0.1

– mutateseq function keeps incoming sequence as string
– speciationrate is now defined as a probability with two decimal places (e.g. 0.25)

Summary
import string
import random
import numpy

# starting sequence
ancestral = 'atggaacgccatatgcgcccggcccaagatataggttggcagcccgaaacagcctcaaccccacaccttgacgttgatcgcgtaagccgcccgatgtgcatacctgtgttcagcatagggataaaactctggtcttccataaaggcacccgttcgttactcggcccagactgcatacctgtccgaccggggcttcccgtcttcccctttaagtagtcctcttcgcggattctcgcgacaagttgtcctaggcggatggagcccacacgaattcccaagctttcgcggaataatatgggtagccatgcttatttttttatcctggcaactcgtacagcctagtaactcctcgcgcccggctaagggcgggcacgtgctgaacttcagggcacgtacgttggctacgaagtacacagaagacgttgctttaggatctcgcaggcgtcggggacggtacggatacccagtgacacaaaacgtaaaaagtgaactcgaaacggatcagtcacgcacttacgtacccttcgcctcaggtggatataagccgatttcggtccttgggccatggtctcacagatgcctaccgaaaattttactataa'

# initializes species list A-Z
specieslist = list(string.ascii_uppercase)

# initializes tree in Newick format
speciestree = '(A)'

# modifies species tree for new speciation events
# send species that is speciating, name of offspring species, and the current tree
# returns modified tree in Newick format
def splitspecies(species, offspring, tree):    
    speciate = species
    newbranch = offspring 
    tree = tree.replace(speciate, f'({speciate}, {newbranch})')    
    return(tree)

# randomly chooses one base and its replacement i.e. snp
# returns mutated sequence as string
def mutateseq(seq):    
    baseset = {'a', 't', 'c', 'g'}
    seqlen = len(seq)
    mutbasenum = random.randrange(0, seqlen, 1)        
    mutation = random.choice(list(baseset.difference(set(seq[mutbasenum]))))          
    seq = seq[:mutbasenum] + mutation + seq[mutbasenum+1:]    
    return(seq)

# initializes dictionary of sequences for species, sets A as starting sequence
speciesdict = {k:'' for k in specieslist}
speciesdict['A'] = ancestral

# a generation is a round of speciation events and mutations
generations = 1000

# for each generation, speciationrate determines chance of specciation
# and must be no more than two decimal places
# in each generation, all sequences are mutated at one base
# loop is stopped if there are 10 or more species at start of new loop
speciationrate = 0.25

listcounter = 1

for i in range(generations):    
    if listcounter >= 10:
        break
    for k in speciesdict:
        if speciesdict[k] != '':
            speciatechance = random.randrange(1, 100, 1)
            if speciatechance <= speciationrate*100:
                speciesdict[specieslist[listcounter]] = mutateseq(speciesdict[k])
                speciesdict[k] = mutateseq(speciesdict[k])                                
                speciestree = splitspecies(k, specieslist[listcounter], speciestree)
                listcounter += 1
            else:                
                speciesdict[k] = mutateseq(speciesdict[k])
                
# data clean up and printing results in Newick and fasta formats
newicklist = list(speciestree)
del newicklist[0]
del newicklist[-1]
newick = ''.join(newicklist)
print(f'{newick};')
for k, v in speciesdict.items():
    if v != '':
        print(f'>{k}')
        print(v)
    

        

Here’s one I’m familiar with:

Doubtless there are newer ones.

1 Like

You won’t want mine then :slight_smile: These days I spend my professional time critiquing others development processes rather than doing any myself.

I’ll send you a couple.