Fun with Python: Sequence Evolution Simulation

T_aquaticus · September 12, 2022, 4:05pm

I have been attempting to learn Python for the last few months, and one thing that has helped me is dabbling in random projects. To that end, I wrote a very rudimentary script for simulating the evolution of a DNA sequence (click on Summary) to show how descent with modification produces a nested hierarchy.

Reminder: I’m just a beginner, so forgive any poor scripting on my part (already made two edits within 20 minutes of posting, apologies if you saw previous code):

latest version 0.1 in post 8

Summary

import string
import random

# starting sequence
ancestral = 'atggaacgccatatgcgcccggcccaagatataggttggcagcccgaaacagcctcaaccccacaccttgacgttgatcgcgtaagccgcccgatgtgcatacctgtgttcagcatagggataaaactctggtcttccataaaggcacccgttcgttactcggcccagactgcatacctgtccgaccggggcttcccgtcttcccctttaagtagtcctcttcgcggattctcgcgacaagttgtcctaggcggatggagcccacacgaattcccaagctttcgcggaataatatgggtagccatgcttatttttttatcctggcaactcgtacagcctagtaactcctcgcgcccggctaagggcgggcacgtgctgaacttcagggcacgtacgttggctacgaagtacacagaagacgttgctttaggatctcgcaggcgtcggggacggtacggatacccagtgacacaaaacgtaaaaagtgaactcgaaacggatcagtcacgcacttacgtacccttcgcctcaggtggatataagccgatttcggtccttgggccatggtctcacagatgcctaccgaaaattttactataa'

# initializes species list A-Z
specieslist = list(string.ascii_uppercase)

# initializes tree in Newick format
speciestree = '(A)'

# modifies species tree for new speciation events
# send species that is speciating, name of offspring species, and the current tree
# returns modified tree in Newick format
def splitspecies(species, offspring, tree):    
    speciate = species
    newbranch = offspring 
    tree = tree.replace(speciate, f'({speciate}, {newbranch})')    
    return(tree)

# randomly chooses one base and its replacement i.e. snp
# returns mutated sequence as string
def mutateseq(seq):
    seqlist = list(seq)
    baseset = {'a', 't', 'c', 'g'}
    seqlen = len(seq)
    mutbasenum = random.randrange(0, seqlen, 1)
    mutbase = set(seq[mutbasenum])    
    mutation = random.choice(list(baseset.difference(mutbase)))       
    seqlist[mutbasenum] = mutation
    outseq = ''.join(seqlist)
    return(outseq)

# initializes dictionary of sequences for species, sets A as starting sequence
speciesdict = {k:'' for k in specieslist}
speciesdict['A'] = ancestral

# a generation is a round of speciation events and mutations
generations = 1000

# for each generation, speciationrate determines chance of specciation
# in each generation, all sequences are mutated at one base
# loop is stopped if there are 10 or more species at start of new loop
speciationrate = 5
listcounter = 1
for i in range(generations):    
    if listcounter >= 10:
        break
    for k in speciesdict:
        if speciesdict[k] != '':
            speciatechance = random.randrange(0, speciationrate, 1)
            if speciatechance == 0:
                speciesdict[specieslist[listcounter]] = mutateseq(speciesdict[k])
                speciesdict[k] = mutateseq(speciesdict[k])                                
                speciestree = splitspecies(k, specieslist[listcounter], speciestree)
                listcounter += 1
            else:                
                speciesdict[k] = mutateseq(speciesdict[k])
                
# data clean up and printing results in Newick and fasta formats
newicklist = list(speciestree)
del newicklist[0]
del newicklist[-1]
newick = ''.join(newicklist)
print(f'{newick};')
for k, v in speciesdict.items():
    if v != '':
        print(f'>{k}')
        print(v)

For those unfamiliar with Python, I would suggest using Jupyter notebook which is part of the Anaconda package. You can also run the script online at Online Python Compiler (Interpreter) .

The script starts with a random open reading frame. In each iteration there are two things that happen. First, a random number is chosen between 0 and speciationrate. If the number is 0 then a branch splits. Second, a base is randomly mutated in every branch. The script also keeps track of the real ancestral history of the species group in Newick format.

The script prints out the real ancestral history and the evolved sequences in fasta format. For visualization, I have been using Trex-online, and for constructing a phylogeny from the sequence data I have been using Simple Phylogenetic Tree < Phylogeny < EMBL-EBI .

The tree constructed from sequence closely matches the real ancestral history. The only difference is the deepest nodes which are flattened out in the sequence phylogeny tool. Again, this is a very crude simulation so I don’t factor in such things as mutational bias, indels, selection, differing mutation rates, and so forth. I am also a newb when it comes to phylogenetics, so any suggestions for different phylogenetic analyses would be greatly helpful.

Please feel free to suggest changes and improvements to the script itself.

Joe_Felsenstein · September 13, 2022, 12:35am

So the higher the speciation rate, the less likely there is to be a speciation?

Roy · September 13, 2022, 8:43am

I’d have implemented mutateseq as a probability of a mutation at each position rather than exactly one mutation. If I was feeling ambitious I’d also have implemented insertions and deletions.

If I had implemented exactly one mutation, I wouldn’t have converted the genome from a string to a list and back again, I’d have just used substring manipulation.

I have a dozen or so simple EA’s implemented in Python if you’re interested.

T_aquaticus · September 13, 2022, 2:55pm

Correct.

Joe_Felsenstein · September 13, 2022, 6:48pm

Odd terminology. Also, let me comment on one other property of your simulation. Is it true that at each time-step you change one base in one lineage? Or is it one base per lineage? If the former, then the substitution rate drops as there are more and more lineages coexisting.

T_aquaticus · September 13, 2022, 7:40pm

Fair criticism. I can probably think of a better variable name.

One base is changed in all sequences in each iteration. The sequences are kept in a dictionary and the for loop iterates over each sequence in the dictionary.

T_aquaticus · September 13, 2022, 8:17pm

It would be very helpful to see how the professionals do it. 90% of the reason I wrote the script was to see if I could actually do it, so I am definitely eager to learn what I can.

T_aquaticus · September 13, 2022, 9:04pm

Sequence Evolution Simulation (SES) version 0.1

– mutateseq function keeps incoming sequence as string
– speciationrate is now defined as a probability with two decimal places (e.g. 0.25)

Summary

import string
import random
import numpy

# starting sequence
ancestral = 'atggaacgccatatgcgcccggcccaagatataggttggcagcccgaaacagcctcaaccccacaccttgacgttgatcgcgtaagccgcccgatgtgcatacctgtgttcagcatagggataaaactctggtcttccataaaggcacccgttcgttactcggcccagactgcatacctgtccgaccggggcttcccgtcttcccctttaagtagtcctcttcgcggattctcgcgacaagttgtcctaggcggatggagcccacacgaattcccaagctttcgcggaataatatgggtagccatgcttatttttttatcctggcaactcgtacagcctagtaactcctcgcgcccggctaagggcgggcacgtgctgaacttcagggcacgtacgttggctacgaagtacacagaagacgttgctttaggatctcgcaggcgtcggggacggtacggatacccagtgacacaaaacgtaaaaagtgaactcgaaacggatcagtcacgcacttacgtacccttcgcctcaggtggatataagccgatttcggtccttgggccatggtctcacagatgcctaccgaaaattttactataa'

# initializes species list A-Z
specieslist = list(string.ascii_uppercase)

# initializes tree in Newick format
speciestree = '(A)'

# modifies species tree for new speciation events
# send species that is speciating, name of offspring species, and the current tree
# returns modified tree in Newick format
def splitspecies(species, offspring, tree):    
    speciate = species
    newbranch = offspring 
    tree = tree.replace(speciate, f'({speciate}, {newbranch})')    
    return(tree)

# randomly chooses one base and its replacement i.e. snp
# returns mutated sequence as string
def mutateseq(seq):    
    baseset = {'a', 't', 'c', 'g'}
    seqlen = len(seq)
    mutbasenum = random.randrange(0, seqlen, 1)        
    mutation = random.choice(list(baseset.difference(set(seq[mutbasenum]))))          
    seq = seq[:mutbasenum] + mutation + seq[mutbasenum+1:]    
    return(seq)

# initializes dictionary of sequences for species, sets A as starting sequence
speciesdict = {k:'' for k in specieslist}
speciesdict['A'] = ancestral

# a generation is a round of speciation events and mutations
generations = 1000

# for each generation, speciationrate determines chance of specciation
# and must be no more than two decimal places
# in each generation, all sequences are mutated at one base
# loop is stopped if there are 10 or more species at start of new loop
speciationrate = 0.25

listcounter = 1

for i in range(generations):    
    if listcounter >= 10:
        break
    for k in speciesdict:
        if speciesdict[k] != '':
            speciatechance = random.randrange(1, 100, 1)
            if speciatechance <= speciationrate*100:
                speciesdict[specieslist[listcounter]] = mutateseq(speciesdict[k])
                speciesdict[k] = mutateseq(speciesdict[k])                                
                speciestree = splitspecies(k, specieslist[listcounter], speciestree)
                listcounter += 1
            else:                
                speciesdict[k] = mutateseq(speciesdict[k])
                
# data clean up and printing results in Newick and fasta formats
newicklist = list(speciestree)
del newicklist[0]
del newicklist[-1]
newick = ''.join(newicklist)
print(f'{newick};')
for k, v in speciesdict.items():
    if v != '':
        print(f'>{k}')
        print(v)

John_Harshman · September 13, 2022, 9:04pm

Here’s one I’m familiar with:

Doubtless there are newer ones.

Roy · September 14, 2022, 8:41am

You won’t want mine then These days I spend my professional time critiquing others development processes rather than doing any myself.

I’ll send you a couple.

Topic		Replies	Views
Simulated Evolution Conversation Science , Design , Education	12	1011	January 9, 2022
Geneological Adam and Eve software Conversation	37	1731	January 11, 2022
Simulating 500 million years of evolution with a language model Conversation Science , Artificial-Intelligence	9	183	February 2, 2025
Frustrated Evolutionary Networks Conversation	7	773	June 3, 2021
Human Genetics Confirms Mutations as the Drivers of Diversity and Evolution Conversation Science	10	471	February 21, 2019

Fun with Python: Sequence Evolution Simulation

Related topics