# Chimpanzee Contigs and the Human Genome: 84% Similar?

Anyone want to comment on this?

Missing key controls. Obvious error too.

If you mean the use of ā-max_hsps 1ā Iām not sure itās an error, as Tomkins goes out of his way to use it and has done so repeatedly.

For those who donāt know, setting max_hsps to 1 means that the BLAST will only return a single alignment result per query sequence. I played around with this myself. I made 2 random sequences. The first is 140bp long:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCACTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG

The second is exactly identical, but with a 17bp sequence inserted into the middle of it (bold), so it has 70bp identical to the first sequence, the 17bp insertion, and then another 70bp identical to the first sequence:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCAGCGTCATGCTTGACTGTCTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG

There are a couple of ways you could calculate the identity between the 2 sequences. First, you could could say theyāre 100% identical with a single INDEL, but most people would probably say ā140bp match, 17bp donāt, so theyāre 140/157=89.2% identicalā.

Not Tomkins though. When I used his parameters in a BLAST of sequence 1 against sequence 2, the result came out to a 100% match over 70bp, meaning that overall, sequence 1 scores as having a 50% match with sequence 2. This is because the BLAST aligned the first 70bp just fine, but then the cost to extend the gap 17bp was too much, so once the cost got too big, the alignment stopped. The other side of the gap wasnāt even considered, even though thereās 70bp of 100% identical sequence, all because of Tomkinsā use of -max_hsps 1. If I relax that to max_hsps 2, I get 2 alignment results, one representing the first 70bp (100% identity), and another representing the second 70bp (100% identity). Iād bet that if Tomkins relaxed this parameter, weād find that his calculated percentage identity would go up significantly.

The parameters:
-evalue 0.1 -word_size 11 -outfmt ā10 qseqid qstart qend mismatch gapopen pident nident length qlenā -max_target_seqs 1 -max_hsps 1 -dust no -soft_masking false -perc_identity 50 -gapopen 3 -gapextend 3 -num_threads 10

Iām glad to see that heās stopped using -ungapped though, thatās progressā¦

EDIT: I might have been too hasty, unlike his previous analyses, he seems to have separated out the unaligned sequences, not counting them in his final percentage identity. He has an average of 34% of sequences aligned to the human genome, and of those aligned sequences, they apparently have an identity of 84%. I need to investigate a bit further.

2 Likes

It must be sad existence for Tomkins at AIG, all he seems to spend his time doing is big BLASTs.

Does he reference the new Ape assemblies that do not use the human genome as a template? Improved ape genome.

1 Like

Large genomic comparisons arenāt my expertise. With that said, would other tools be a better fit? For example, I input your test sequences into LALIGN and selected the global alignment option and it returned the expected 89.2% without changing the rest of the default settings.

https://embnet.vital-it.ch/software/LALIGN_form.html

Am I wrong in thinking that BLAST is a poor tool for measuring percent identity between large genome sequences?

Yes:

At the time of this publication, a new version of the chimpanzee genome has been announced (PanTro6) that was assembled completely de novo without the use of a human as a reference scaffold (Kronenberg et al. 2018). According to correspondence with UCSC genome browser staff at the time of this report, āThe panTro6 assembly has not yet been reviewed by our Quality Assurance teamā and is not available for public download. However, LASTZ alignments with the human genome have been performed and are available for download (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/vsPanTro6/). LASTZ is a large-scale genome alignment tool that can efficiently align chromosomal or genomic sequences millions of nucleotides in length.

1 Like

Iām no expert either. BLAST returns the expected 89.2% using default parameters too, itās just Tomkins that wants to use -max_hsps 1 for some reason.

2 Likes

Thanks so much for providing this explanation and example. It is enormously helpful!

2 Likes

@glipsnort, your exchange with Buggs on the Human-Chimp similarity was referenced by this article too. Amazing how that works. Thomkins cites Buggsā analysis as independent evidence that his result is correct. Thatās a bit anoying, because that took place when I was first banned from the forum. I could have helped on that one with Buggs.

1 Like

Iām still waiting for Tomkins to do the apples to apples work that I and many others have asked him to. First, I want him to use the same methods and compare two complete human genomes so we have some idea how much difference there is just between humans. He also needs to compare a human and neanderthal genome and he should compare the chimp and gorilla genomes. Remember that most YECs are saying that gorillas and chimps are the same ākindā - the āape kindā I would love to have Tomkins produce a similarity chart of all of these together. But I doubt he is going to do that because it will undermine most of the points he thinks he is making.

3 Likes

Aside from the problem already mentioned, thereās the problem that a single indel of 17 bases counts as (at least) 17 differences, when it should count as one difference at most. The common figure of 98.7% sequence identity is for aligned sequences, which doesnāt count indels.

Itās not really kosher to include indels along with aligned sites in a single measure of genetic distance, since the two arenāt comparable. One might try it by constructing an evolutionary model in which the frequencies of indels of various sizes would be parameters alongside the frequencies of the various base changes and then use that to estimate genetic distances, but I donāt think anyone has ever done that. All the low similarity measures (<98%) have assumed that a 100-base indel counts as 100 differences.

Now, Iām not sure Tomkins even gets that far. Heās done a lot of weird things before. I believe that one of his distance measures has just counted the percentage of randomly chosen 30-base segments identical between species ā no indels, no base changes ā as a distance measure, which is odd in itself and really canāt be compared to any other sort of measure.

2 Likes

My pleasure. However, as I say in the edit at the bottom of my comment, Iām a bit unclear as of yet how what I said applies to this particular analysis of Tomkinsā. I know it applies to some of his previous work, for example where he estimated the human-chimp genome identity to be 70% by aligning short (500bp) sequences using -ungapped. In that case, take what I said about the 17bp INDEL in a 140bp sequence and change it to a 1bp INDEL in a 500bp sequence.

2 Likes

As long as the differences are accurately described I donāt have a problem with using a straight base to base comparison. The chimp genome paper has the figures we need for comparisons to Tomkinsā work.

67 million bases of unique DNA in a 3 billion base genome is an additional ~2% difference stacked on top of the 1-2% difference due to substitutions. This puts the total at around 96-97% for a base to base comparison.

As mentioned by others, he used an ungapped analysis which means that a 300 bp comparison that differs by a single indel in the middle of the sequence would be counted as ~75% similar even though it differs by a single base.

At least this time around Tomkins appears to be including gaps, but time will tell if they were correctly accounted for.

2 Likes

Since the past is prologue, it might be worth linking a debunking of Tomkinsā previous attempt at challenging the consensus on the human-chimp genome comparison:

It is interesting that Tomkinsā previous methods actually agreed with the consensus once the bugs were removed and gaps were accounted for properly. It would appear that Tomkins is in search of a method that will give him the results he wants.

4 Likes

Glenn Williamson blogs at roohif.wordpress.com

Heās done quite a few analyses of creationist claims regarding genomics there, especially a lot regarding the human chromosome 2 fusion. He also makes videos e.g this one, about other observed instances of telomere-telomere chromosomal fusions:

I know him pretty well - we talk about these things now and then, and heās helped me out a lot in the past with some of the bioinformatics involved. Iāve already sent him this new article from Tomkins to have a look at.

1 Like

Yoohoo!!

Iāll be having a look at this today for a few hours hopefully, and hopefully have a video published within a week (which Iāll link to here as well).

4 Likes

I do. Some distance measures make sense, others donāt. Percent difference in aligned sequences estimates the number of point mutations, and thus is also an estimate of time since divergence. That canāt be integrated with the number of bases of indel in any sensible way.