I've been Testing Tomkins Methods and I'd like some Peer Review!

Hey Everyone,

My apologies for being rather absent. I’ve been slammed with PhD work and keeping the Youtube channel running.

I’ve grown tired of seeing the “Tomkins Numbers” for Human/Chimp similarity around, and so a few weeks ago I decided to use his methods to run some additional comparative genomics. Now, here on Peaceful Science, I know everyone understands that Jeffery Tomkins has methods that are entirely inappropriate. But for the uninitiated: Jeffery Tomkins is a YEC plant geneticist who made waves in 2013ish (and then every few years since) by arguing the right way to compare genomes yields a human/chimp similarity of 80-88%.

There are oodles of shenanigans going on here, for various reasons.

Initially Tomkins published his 2013 “study”: Comprehensive Analysis of Chimpanzee and Human Chromosomes Reveals Average DNA Similarity of 70%. This piece was rather quickly superseded by a subsequent work when Glenn Williamson discovered that Tomkins was using a bugged version of BLAST for his analysis. The new piece was titled: Documented Anomaly in Recent Versions of the BLASTN Algorithm and a Complete Reanalysis of Chimpanzee and Human Genome-Wide DNA Similarity Using Nucmer and LASTZ.

So, he started by claiming humans and chimpanzees are 70% similar, followed by a re-run of the initial study that upped the similarity to around 88%, although he speculates that the "real* number hovers around 80% or less: “…the majority of flow cytometry studies of chimpanzee nuclei along with the cytogenetic analysis of chromosomes indicate a genome size difference of about 8%, with the chimpanzee genome having a significantly larger amount of heterochromatic DNA compared to human (Formenti et al. 1983; Pellicciari et al. 1982, 1988, 1990a, 1990b; Seuanez et al. 1977). Thus, the actual genome similarity with human, even using the high end estimate of 88% for just the alignable regions, is realistically only about 80% or less when the cytogenetic data is taken into account” (Tomkins, 2015).

In 2018, another piece is shoved into the limelight by Tomkins: Comparison of 18,000 De Novo Assembled Chimpanzee Contigs to the Human Genome Yields Average BLASTN
Alignment Identities of 84%. This work is different from the others, because the methods actually vary slightly (we will discuss this later), and he additionally uses the first published contigs of panTro6, a chimp genome assembled without using the human genome as a reference. As you can see from the title, it concludes that the percent similarity between humans and chimps is actually 84% (for real this time).

So, as a super quick explanation for why Tomkins methods are all wrong/inappropriate, here is my understanding. In 2013 he was obviously using a bugged version of BLAST, but in 2015 he essentially ran the same analysis with the same methods in the tool BLAST. He essentially splits up each chimp chromosome into 300bp slices for his comparison to the human genome: “As in the previous study, a sequence slicing strategy of the chromosomal query sequence was employed to overcome the inability of the BLASTN algorithm to produce alignments beyond a few hundred bases. A sequence slice of 300 bases was used which according to the alignment length results produced by the nucmer algorithm (discussed below) was appropriate because Nucmer, on average, produced alignment lengths of 300 bases or more for all chimpanzee chromosomes.”

Then the following BLAST parameters were used: evalue 10, word_size 11, max_target_seqs 1, dust no, soft_ masking false, ungapped, and num_threads 6.

Immediately you can see the issue: the analysis is ungapped, allowing for no insertions etc. This method is typically used for extremely similar to near-identical sequences, and is not considered appropriate for full genomic comparison even within a species.

The 2018 work was different. The parameters are potentially less egregious: evalue 0.1, word_
size 11, outfmt 10, qseqid, qstart, qend, mismatch, gapopen, pident, nident, length, qlen, max_target_
seqs 1, max_hsps 1, dust no, soft_masking false, perc_identity 50, gapopen 3, gapextend 3, num_ threads 10 (although that perc identity is rough), but the worst party has little to do with the actually comparison. As noticed again by Glenn Williamson, Tomkins does not weight his sequences. This means a 30 bp sequence that is 50% similar is weighted the same as a 30000 bp sequence that is identical between the human/chimp. This is evident in the material published by Tomkins on Github where it is revealed he is simply averaging the pident values.

So, we have all been through this before. We all know his methods are bogus, and yet the numbers are repeated constantly on every YEC site around. Tomkins will try to defend his methods occasionally, suggesting his method is the most honest and uninfluenced by the evolutionary paradigm. But what I, and many others, have found very curious indeed is that Tomkins ONLY compared the human/chimp genomes using these methods. He has only EVER compared these two. What would happen if he compared a chimp and a bonobo? Or a chimp and a gorilla? Or a dog and a wolf? Or how about two humans? Seeing as his methods clearly aim to maximize differences, and his 2013 pieces certainly do so (I believe the 2018 is more brutely deceptive), myself and others hypothesized that the above comparisons would yield much more divergent results, on par with the human/chimp disparity.

And if a chimp and a gorilla are similarly divergent, what would that say about the proposal that these apes are in the same “kind”? Suppose the human/human comparison aren’t 99.9% similar as Ken Ham proposes, suppose they are 94% similar, or less? That would suggest Tomkins methods are simply changing the scale of the differences, not “reducing” the similarity.

So if I may, I’d like to run my work by some resident geneticists, as I have been running these exact comparisons (and more). I would like to share my methods before I release anything, to be sure that I am appropriately representing Tomkins methods.

My methods:

I downloaded the most up-to-date version of BLAST from NCBI: Index of /blast/executables/blast+/LATEST
I downloaded some genomes from Ensembl (Ensembl genome browser 109), UCSC, and NCBI.

I am using unmasked genomes, so when only soft-masked reference genomes were found, I used a script to auto-capitalize the entire .fa file.

Ensembl usually works, but does not have chromosome level assembly for several species, which is what my exe uses. Tomkins also rails on about how panTro3 is too “humanized” so I got panTro6 from here: Index of /goldenpath/hg38/vsPanTro6

Han population and Japanese population reference genomes come from here, and are soft masked as far as I can tell, so I had to “unmask” them.
Han:Han1 - Genome - Assembly - NCBI
Japan: Construction of JRG (Japanese reference genome) with single-molecule real-time sequencing - PMC

I use a cmd script to generate random 300bp sequences from a target chromosome, and query it 1000-10000 times (Depending on the run) against the same chromosome from another species using the following parameters:

-outfmt "10 qseqid qstart qend mismatch gapopen pident nident length qlen" -max_hsps 1 -max_target_seqs 1

I term this the “Good” method haha.

The alignment is put into a .csv.

I used Tomkins exact 2013/2015 methods.

These parameters are:

-outfmt “10 qseqid qstart qend mismatch gapopen pident nident length qlen” -evalue 10 -word_size 11
-max_target_seqs 1 -dust no -soft_masking false -ungapped

In his github published work, oddly enough, he occasionally uses the task blastn function which looks like this:

-task blastn -outfmt “10 qseqid qstart qend mismatch gapopen pident nident length qlen” -evalue 10 -word_size 11
-max_target_seqs 1 -dust no -soft_masking false -ungapped

However, I stuck with the first parameters instead of using task blastn. I skipped his 2018 methods because the parameters are less relevant than how he actually averages the pident values.

The real place where my analysis diverges from his is in the number of queries. Tomkins ends up getting around 30 million base pairs (300 bp *100000) aligned for each chromosome since he is using a monster of a computer. I only have 300000 bp per chromosome. I do not think this limitation is too problematic, as tomkins publishes a table in his 2015 paper comparing % similarity of 300bp 10,100,1000,10000, and 100000 times and they are roughly equivalent after 1000 queries. My computer, even my lab’s computer, simply isn’t strong enough to run these analyses within a reasonable time frame.

So what do you think? Were these methods reasonable?

9 Likes

I’m not a geneticist, so I don’t know the amount of time, effort, and money involved, but I think you should do exactly that. Produce some results under Tompkins methods just to see how silly they are, then let him explain the nonsense if he can.

7 Likes

A good place to start would be the methods from the 2005 Chimp Genome paper. Since they were actually competent, unlike Tomkins.

1 Like

Let me second this. The development of ‘good’ methods should be secondary. Focus on using his methods exactly on species pairs that creationists would insist ‘should’ be close. And a few tests between different genomes of the same species, just for fun.

5 Likes

From what I can tell so far, this seems to be all good. It would be definitely interesting to see how similar the genomes of two species that they considered to be of the same “kind” are to each other (e.g. chimps are to orangutans or mice are to rats); and see how those similarities compare to Tomkin’s figure for human-chimp genomic similarity. Expect them to respond with excuses… or dare I say “rescuing devices”… to the effect of stating that you cannot use this method to compare species of the same “kind” and compare species of different “kinds”.

Lastly, I would also include the following caveat. Comparing two genomes and describing the degree of similarity in terms of a percentage is very specious IMO. As you alluded to, this works well straightforwardly if you are dealing with highly similar sequences that (mostly) differ by point mutations. However, what if you are dealing with large indels. E.g. two sequence are identical, except for indels that sum 1000 bp. Do you treat this 1000 bp as equal to 1000 point mutations to calculate the percentage difference? And how about translocation and inversion? It gets really muddy to come up with exact percentage figures.

Instead of percentages, I prefer to compare genomes in terms of synteny. This also allows for showing (even the structural) similarities of the genomes with visuals that are more intuitive to understand rather than just an abstract number.

Some examples:
Here is the Human (Homo sapiens) compared to Chimp (Pan troglodytes).
Below is the Comparative Genomic Viewer from NCBI. Ribboms correspond to synteny blocks (green = forward alignment, purple = reverse alignment). The minimum alignment size is set at 100k bp.


And here is a plot that shows the synteny blocks in colours from Genomicus at default settings:

Note how human chromosome 2 aligns with chimp chromosomes 2a and 2b.

Here is the Genomicus figure for chimp and orangutan (Comparative Genomic viewer doesn’t have the comparison for these two unfortunately).


It’s hard to tell, but there are more structural differences between these two (more inversions) compared to the chimp and human comparison.

Now, for something more obvious, let’s see the genomic comparison between Rat (Rattus norvegicus) and Mouse (Mus musculus). Comparative Genome Viewer same settings:


And the Genomicus figure.

Admittedly, the chromosomes in the first figure could’ve been ordered better, but it’s still obvious that (genome wise) humans are more similar to chimps than rats are to mice.

7 Likes

I totally agree with this, synteny is more appropriate! However, I am trying to stick with BLAST comparisons in order to show the flaws in Tomkins’ argument specifically.

2 Likes

Oh, yes, absolutely. I am not telling that you shouldn’t. I was just suggesting (if you end up recording a video on this) that you also mention this bit about synteny, to further emphasize just how similar our genome is to that of chimps, as supplementary to the results of your BLAST comparison.

Compare mice and rat.

And also compare human to human.

Using the same methods of course. Those two controls make obvious how absurd his approach is.

3 Likes

He doesn’t always use BLAST.

You also might find this analysis helpful: Human and Chimp Similarity (Mind the Controls)

Here are my comparisons thus far:

human/chimp (pantro6)
human/bonobo
bonobo/chimp (pantro6)
bonobo/chimp (pantro3)
Han population/Japanese population
human/gorilla
chimp(pantro6)/gorilla
mouse/rat
mouse/Algerian mouse
lion/cat
bos indicus/bos taurus
dog/wolf
olive baboon/gelada
rhesus macaque/long-tailed macaque

Most of these are done, with 1000 queries of 300bp slices of the former with the latter. I have done the “good” and “Tomkins” method each time. Let me know if you can think of any others I should perform, with links if you can! I can unmask softmasked genomes if need be.

2 Likes

He has used LASTZ and nucmer as well, but the methods are identical. The issue is the ungapped parameter for everything prior to 2018, and unweighting for 2018.

I saw this post, it was very helpful! I will say though, as I have gone through his methods and papers, given he is a Plant Geneticist, I am having a very hard time chocking this (Tomkins methods and reported results) up to mistake. I am very concerned about intentional cherry picking, particularly comparing the 2015 to 2018.

I hope I am wrong though!

Greatly looking forward to your video on this, Erika! (if you end up recording one)

To top it all off it would be funny to do a distance-based phylogeny using Tomkins irrational distance measures just to show that you recover the standard phylogeny regardless of whether you exaggerate or minimize the genetic differences between species. That, in either case, you still get a consistent nested hierarchy.

2 Likes

It’s fine to do all these pairs, but it’s critical to present them in small doses so as to explain the significance and expectations to the audience.

If you’d like to write this down too in an article for Ps, we’d love to publish it.

2 Likes

Also I think it is really important to use the actual reads, which are NOT the same as just 300 bp slices. Using the NIHs SRA database you can access the raw reads.

If you really want to insist on using the assembled genomes, use a “read simulator”…see A Simple Introduction to Read Simulators | by Vijini Mallawaarachchi | The Computational Biology Magazine | Medium . These realistically add in sequencing errors, to simulate any platform you like.

The read-level sequencing errors are critical for understanding the behavior of his approach. Critically, this also enables you to compare simulated reads realistically to the genome from which they are simulated, so as to compute human-human similarity.

That’s important to do. Creationists are always making a great effort to show that the difference between human and chimp is Much Bigger Than That. But they don’t point out that if all differences are made bigger, we still get the same tree.

2 Likes

I would love to publish on Peaceful Science once finished! The idea is the get the numbers out there. And I agree, I’m not looking to simply dump numerous comparisons online somewhere.

2 Likes

Using actual reads is the reasonable way of doing things, but it is not how Tomkins does things in his 2013-2015 pieces. I believe he uses actual reads in 2018, but the issue with that work was his final pident calculation. However, I am aiming to perform his exact methods for the above pairs, which is why the 300bp slices are (unfortunately) imperative.

I think your point on actual reads would be great though for the sake of comparison?

1 Like

Sure, you should mirror his approach. In some cases he uses reads, and read simulators might be helpful in those cases.

Have you looked into cloud-based solutions? Cloud hosts such as Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure let you rent out time on the kind of monster computer that you’re talking about here on a pay-as-you-go basis. AWS also do something called “spot instances” which give you up to 90% discount if you’re able to be flexible about when your code runs and if you can set it up to pause and continue from time to time. I would imagine that Tompkins is using something like that rather than anything owned by himself or Answers in Genesis.

Turns out there’s a version of BLAST that will automate the process of setting things up and then shutting them down again once you’re finished.

2 Likes