Chimpanzee Contigs and the Human Genome: 84% Similar?

Patrick · September 10, 2018, 6:19pm

Anyone want to comment on this?

swamidass · September 10, 2018, 6:28pm

Missing key controls. Obvious error too.

evograd · September 10, 2018, 10:11pm

If you mean the use of “-max_hsps 1” I’m not sure it’s an error, as Tomkins goes out of his way to use it and has done so repeatedly.

For those who don’t know, setting max_hsps to 1 means that the BLAST will only return a single alignment result per query sequence. I played around with this myself. I made 2 random sequences. The first is 140bp long:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCACTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG

The second is exactly identical, but with a 17bp sequence inserted into the middle of it (bold), so it has 70bp identical to the first sequence, the 17bp insertion, and then another 70bp identical to the first sequence:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCAGCGTCATGCTTGACTGTCTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG

There are a couple of ways you could calculate the identity between the 2 sequences. First, you could could say they’re 100% identical with a single INDEL, but most people would probably say “140bp match, 17bp don’t, so they’re 140/157=89.2% identical”.

Not Tomkins though. When I used his parameters in a BLAST of sequence 1 against sequence 2, the result came out to a 100% match over 70bp, meaning that overall, sequence 1 scores as having a 50% match with sequence 2. This is because the BLAST aligned the first 70bp just fine, but then the cost to extend the gap 17bp was too much, so once the cost got too big, the alignment stopped. The other side of the gap wasn’t even considered, even though there’s 70bp of 100% identical sequence, all because of Tomkins’ use of -max_hsps 1. If I relax that to max_hsps 2, I get 2 alignment results, one representing the first 70bp (100% identity), and another representing the second 70bp (100% identity). I’d bet that if Tomkins relaxed this parameter, we’d find that his calculated percentage identity would go up significantly.

The parameters:
-evalue 0.1 -word_size 11 -outfmt “10 qseqid qstart qend mismatch gapopen pident nident length qlen” -max_target_seqs 1 -max_hsps 1 -dust no -soft_masking false -perc_identity 50 -gapopen 3 -gapextend 3 -num_threads 10

I’m glad to see that he’s stopped using -ungapped though, that’s progress…

EDIT: I might have been too hasty, unlike his previous analyses, he seems to have separated out the unaligned sequences, not counting them in his final percentage identity. He has an average of 34% of sequences aligned to the human genome, and of those aligned sequences, they apparently have an identity of 84%. I need to investigate a bit further.

evograd · September 10, 2018, 10:16pm

It must be sad existence for Tomkins at AIG, all he seems to spend his time doing is big BLASTs.

swamidass · September 10, 2018, 10:23pm

Does he reference the new Ape assemblies that do not use the human genome as a template? Improved ape genome.

T_aquaticus · September 10, 2018, 10:28pm

Large genomic comparisons aren’t my expertise. With that said, would other tools be a better fit? For example, I input your test sequences into LALIGN and selected the global alignment option and it returned the expected 89.2% without changing the rest of the default settings.

https://embnet.vital-it.ch/software/LALIGN_form.html

Am I wrong in thinking that BLAST is a poor tool for measuring percent identity between large genome sequences?

evograd · September 10, 2018, 10:28pm

Yes:

At the time of this publication, a new version of the chimpanzee genome has been announced (PanTro6) that was assembled completely de novo without the use of a human as a reference scaffold (Kronenberg et al. 2018). According to correspondence with UCSC genome browser staff at the time of this report, “The panTro6 assembly has not yet been reviewed by our Quality Assurance team” and is not available for public download. However, LASTZ alignments with the human genome have been performed and are available for download (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/vsPanTro6/). LASTZ is a large-scale genome alignment tool that can efficiently align chromosomal or genomic sequences millions of nucleotides in length.

evograd · September 10, 2018, 10:30pm

I’m no expert either. BLAST returns the expected 89.2% using default parameters too, it’s just Tomkins that wants to use -max_hsps 1 for some reason.

Joel_Duff · September 10, 2018, 10:51pm

Thanks so much for providing this explanation and example. It is enormously helpful!

swamidass · September 10, 2018, 10:52pm

@glipsnort, your exchange with Buggs on the Human-Chimp similarity was referenced by this article too. Amazing how that works. Thomkins cites Buggs’ analysis as independent evidence that his result is correct. That’s a bit anoying, because that took place when I was first banned from the forum. I could have helped on that one with Buggs.

Joel_Duff · September 10, 2018, 11:02pm

I’m still waiting for Tomkins to do the apples to apples work that I and many others have asked him to. First, I want him to use the same methods and compare two complete human genomes so we have some idea how much difference there is just between humans. He also needs to compare a human and neanderthal genome and he should compare the chimp and gorilla genomes. Remember that most YECs are saying that gorillas and chimps are the same “kind” - the “ape kind” I would love to have Tomkins produce a similarity chart of all of these together. But I doubt he is going to do that because it will undermine most of the points he thinks he is making.

John_Harshman · September 10, 2018, 11:09pm

Aside from the problem already mentioned, there’s the problem that a single indel of 17 bases counts as (at least) 17 differences, when it should count as one difference at most. The common figure of 98.7% sequence identity is for aligned sequences, which doesn’t count indels.

It’s not really kosher to include indels along with aligned sites in a single measure of genetic distance, since the two aren’t comparable. One might try it by constructing an evolutionary model in which the frequencies of indels of various sizes would be parameters alongside the frequencies of the various base changes and then use that to estimate genetic distances, but I don’t think anyone has ever done that. All the low similarity measures (<98%) have assumed that a 100-base indel counts as 100 differences.

Now, I’m not sure Tomkins even gets that far. He’s done a lot of weird things before. I believe that one of his distance measures has just counted the percentage of randomly chosen 30-base segments identical between species — no indels, no base changes — as a distance measure, which is odd in itself and really can’t be compared to any other sort of measure.

evograd · September 10, 2018, 11:15pm

My pleasure. However, as I say in the edit at the bottom of my comment, I’m a bit unclear as of yet how what I said applies to this particular analysis of Tomkins’. I know it applies to some of his previous work, for example where he estimated the human-chimp genome identity to be 70% by aligning short (500bp) sequences using -ungapped. In that case, take what I said about the 17bp INDEL in a 140bp sequence and change it to a 1bp INDEL in a 500bp sequence.

T_aquaticus · September 11, 2018, 4:24pm

As long as the differences are accurately described I don’t have a problem with using a straight base to base comparison. The chimp genome paper has the figures we need for comparisons to Tomkins’ work.

67 million bases of unique DNA in a 3 billion base genome is an additional ~2% difference stacked on top of the 1-2% difference due to substitutions. This puts the total at around 96-97% for a base to base comparison.

As mentioned by others, he used an ungapped analysis which means that a 300 bp comparison that differs by a single indel in the middle of the sequence would be counted as ~75% similar even though it differs by a single base.

At least this time around Tomkins appears to be including gaps, but time will tell if they were correctly accounted for.

T_aquaticus · September 11, 2018, 4:45pm

Since the past is prologue, it might be worth linking a debunking of Tomkins’ previous attempt at challenging the consensus on the human-chimp genome comparison:

It is interesting that Tomkins’ previous methods actually agreed with the consensus once the bugs were removed and gaps were accounted for properly. It would appear that Tomkins is in search of a method that will give him the results he wants.

evograd · September 11, 2018, 5:52pm

Glenn Williamson blogs at roohif.wordpress.com

He’s done quite a few analyses of creationist claims regarding genomics there, especially a lot regarding the human chromosome 2 fusion. He also makes videos e.g this one, about other observed instances of telomere-telomere chromosomal fusions:

I know him pretty well - we talk about these things now and then, and he’s helped me out a lot in the past with some of the bioinformatics involved. I’ve already sent him this new article from Tomkins to have a look at.

roohif · September 11, 2018, 11:29pm

Yoohoo!!

I’ll be having a look at this today for a few hours hopefully, and hopefully have a video published within a week (which I’ll link to here as well).

John_Harshman · September 12, 2018, 12:51am

I do. Some distance measures make sense, others don’t. Percent difference in aligned sequences estimates the number of point mutations, and thus is also an estimate of time since divergence. That can’t be integrated with the number of bases of indel in any sensible way.

swamidass · September 12, 2018, 1:17am

Welcome @roohif. Glad you could join us.

Robert_Byers · September 12, 2018, 2:07am

For this YEC dude i welcome as close a genetic likeness with primates as possible.
Humans uniquely are the only creatures who have another creatures bodyplan.
This because we are made in Gods image. Yet Gods biology has controlled options in its blueprint. so its impossible for us to have our own body that represents our true identity. All other creatures bodies show , originally, what they are in essence. So we are renting a bodyplan. the best one for fun and profit.
if we did not have a primate body THEN what possibly could be another option?? Yet staying within code.
Many YEC want to find great gaps between us and primates but why?
We look close enough that squeezing out points one way or another makes no difference.
We have a chimp body. If we were so different we would have a different body and the most from any critter.
yet we alone have the same as another, especially in YEC view.
therefore its proof we can’t have biological frame anymore then God could have one. That should be our equation.

Topic		Replies	Views
No, a New Paper Did NOT Discover Humans and Chimps are "Only 85% Similar" Conversation Science , Education	49	1054	July 31, 2025
I've been Testing Tomkins Methods and I'd like some Peer Review! Conversation Science , Education	21	1207	April 10, 2023
Human and Chimp Similarity (Mind the Controls) Conversation Science	5	1522	March 6, 2019
Casey Luskin calculates human/chimp sequence divergence at 14.9% Conversation Science , Design	57	803	July 9, 2025
The claim that the 98% similarity between H & C is only based on 2.5% of the genome Conversation	20	596	April 19, 2024

Chimpanzee Contigs and the Human Genome: 84% Similar?

Related topics