Hey Everyone,
My apologies for being rather absent. I’ve been slammed with PhD work and keeping the Youtube channel running.
I’ve grown tired of seeing the “Tomkins Numbers” for Human/Chimp similarity around, and so a few weeks ago I decided to use his methods to run some additional comparative genomics. Now, here on Peaceful Science, I know everyone understands that Jeffery Tomkins has methods that are entirely inappropriate. But for the uninitiated: Jeffery Tomkins is a YEC plant geneticist who made waves in 2013ish (and then every few years since) by arguing the right way to compare genomes yields a human/chimp similarity of 80-88%.
There are oodles of shenanigans going on here, for various reasons.
Initially Tomkins published his 2013 “study”: Comprehensive Analysis of Chimpanzee and Human Chromosomes Reveals Average DNA Similarity of 70%. This piece was rather quickly superseded by a subsequent work when Glenn Williamson discovered that Tomkins was using a bugged version of BLAST for his analysis. The new piece was titled: Documented Anomaly in Recent Versions of the BLASTN Algorithm and a Complete Reanalysis of Chimpanzee and Human Genome-Wide DNA Similarity Using Nucmer and LASTZ.
So, he started by claiming humans and chimpanzees are 70% similar, followed by a re-run of the initial study that upped the similarity to around 88%, although he speculates that the "real* number hovers around 80% or less: “…the majority of flow cytometry studies of chimpanzee nuclei along with the cytogenetic analysis of chromosomes indicate a genome size difference of about 8%, with the chimpanzee genome having a significantly larger amount of heterochromatic DNA compared to human (Formenti et al. 1983; Pellicciari et al. 1982, 1988, 1990a, 1990b; Seuanez et al. 1977). Thus, the actual genome similarity with human, even using the high end estimate of 88% for just the alignable regions, is realistically only about 80% or less when the cytogenetic data is taken into account” (Tomkins, 2015).
In 2018, another piece is shoved into the limelight by Tomkins: Comparison of 18,000 De Novo Assembled Chimpanzee Contigs to the Human Genome Yields Average BLASTN
Alignment Identities of 84%. This work is different from the others, because the methods actually vary slightly (we will discuss this later), and he additionally uses the first published contigs of panTro6, a chimp genome assembled without using the human genome as a reference. As you can see from the title, it concludes that the percent similarity between humans and chimps is actually 84% (for real this time).
So, as a super quick explanation for why Tomkins methods are all wrong/inappropriate, here is my understanding. In 2013 he was obviously using a bugged version of BLAST, but in 2015 he essentially ran the same analysis with the same methods in the tool BLAST. He essentially splits up each chimp chromosome into 300bp slices for his comparison to the human genome: “As in the previous study, a sequence slicing strategy of the chromosomal query sequence was employed to overcome the inability of the BLASTN algorithm to produce alignments beyond a few hundred bases. A sequence slice of 300 bases was used which according to the alignment length results produced by the nucmer algorithm (discussed below) was appropriate because Nucmer, on average, produced alignment lengths of 300 bases or more for all chimpanzee chromosomes.”
Then the following BLAST parameters were used: evalue 10, word_size 11, max_target_seqs 1, dust no, soft_ masking false, ungapped, and num_threads 6.
Immediately you can see the issue: the analysis is ungapped, allowing for no insertions etc. This method is typically used for extremely similar to near-identical sequences, and is not considered appropriate for full genomic comparison even within a species.
The 2018 work was different. The parameters are potentially less egregious: evalue 0.1, word_
size 11, outfmt 10, qseqid, qstart, qend, mismatch, gapopen, pident, nident, length, qlen, max_target_
seqs 1, max_hsps 1, dust no, soft_masking false, perc_identity 50, gapopen 3, gapextend 3, num_ threads 10 (although that perc identity is rough), but the worst party has little to do with the actually comparison. As noticed again by Glenn Williamson, Tomkins does not weight his sequences. This means a 30 bp sequence that is 50% similar is weighted the same as a 30000 bp sequence that is identical between the human/chimp. This is evident in the material published by Tomkins on Github where it is revealed he is simply averaging the pident values.
So, we have all been through this before. We all know his methods are bogus, and yet the numbers are repeated constantly on every YEC site around. Tomkins will try to defend his methods occasionally, suggesting his method is the most honest and uninfluenced by the evolutionary paradigm. But what I, and many others, have found very curious indeed is that Tomkins ONLY compared the human/chimp genomes using these methods. He has only EVER compared these two. What would happen if he compared a chimp and a bonobo? Or a chimp and a gorilla? Or a dog and a wolf? Or how about two humans? Seeing as his methods clearly aim to maximize differences, and his 2013 pieces certainly do so (I believe the 2018 is more brutely deceptive), myself and others hypothesized that the above comparisons would yield much more divergent results, on par with the human/chimp disparity.
And if a chimp and a gorilla are similarly divergent, what would that say about the proposal that these apes are in the same “kind”? Suppose the human/human comparison aren’t 99.9% similar as Ken Ham proposes, suppose they are 94% similar, or less? That would suggest Tomkins methods are simply changing the scale of the differences, not “reducing” the similarity.
So if I may, I’d like to run my work by some resident geneticists, as I have been running these exact comparisons (and more). I would like to share my methods before I release anything, to be sure that I am appropriately representing Tomkins methods.
My methods:
I downloaded the most up-to-date version of BLAST from NCBI: Index of /blast/executables/blast+/LATEST
I downloaded some genomes from Ensembl (Ensembl genome browser 109), UCSC, and NCBI.
I am using unmasked genomes, so when only soft-masked reference genomes were found, I used a script to auto-capitalize the entire .fa file.
Ensembl usually works, but does not have chromosome level assembly for several species, which is what my exe uses. Tomkins also rails on about how panTro3 is too “humanized” so I got panTro6 from here: Index of /goldenpath/hg38/vsPanTro6
Han population and Japanese population reference genomes come from here, and are soft masked as far as I can tell, so I had to “unmask” them.
Han:Han1 - Genome - Assembly - NCBI
Japan: Construction of JRG (Japanese reference genome) with single-molecule real-time sequencing - PMC
I use a cmd script to generate random 300bp sequences from a target chromosome, and query it 1000-10000 times (Depending on the run) against the same chromosome from another species using the following parameters:
-outfmt "10 qseqid qstart qend mismatch gapopen pident nident length qlen" -max_hsps 1 -max_target_seqs 1
I term this the “Good” method haha.
The alignment is put into a .csv.
I used Tomkins exact 2013/2015 methods.
These parameters are:
-outfmt “10 qseqid qstart qend mismatch gapopen pident nident length qlen” -evalue 10 -word_size 11
-max_target_seqs 1 -dust no -soft_masking false -ungapped
In his github published work, oddly enough, he occasionally uses the task blastn function which looks like this:
-task blastn -outfmt “10 qseqid qstart qend mismatch gapopen pident nident length qlen” -evalue 10 -word_size 11
-max_target_seqs 1 -dust no -soft_masking false -ungapped
However, I stuck with the first parameters instead of using task blastn. I skipped his 2018 methods because the parameters are less relevant than how he actually averages the pident values.
The real place where my analysis diverges from his is in the number of queries. Tomkins ends up getting around 30 million base pairs (300 bp *100000) aligned for each chromosome since he is using a monster of a computer. I only have 300000 bp per chromosome. I do not think this limitation is too problematic, as tomkins publishes a table in his 2015 paper comparing % similarity of 300bp 10,100,1000,10000, and 100000 times and they are roughly equivalent after 1000 queries. My computer, even my lab’s computer, simply isn’t strong enough to run these analyses within a reasonable time frame.
So what do you think? Were these methods reasonable?