Chimpanzee Contigs and the Human Genome: 84% Similar?


(Dr. Patrick Trischitta) #1

Anyone want to comment on this?


(S. Joshua Swamidass) #2

Missing key controls. Obvious error too.


(Blogging Graduate Student) #3

If you mean the use of “-max_hsps 1” I’m not sure it’s an error, as Tomkins goes out of his way to use it and has done so repeatedly.

For those who don’t know, setting max_hsps to 1 means that the BLAST will only return a single alignment result per query sequence. I played around with this myself. I made 2 random sequences. The first is 140bp long:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCACTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG

The second is exactly identical, but with a 17bp sequence inserted into the middle of it (bold), so it has 70bp identical to the first sequence, the 17bp insertion, and then another 70bp identical to the first sequence:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCAGCGTCATGCTTGACTGTCTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG

There are a couple of ways you could calculate the identity between the 2 sequences. First, you could could say they’re 100% identical with a single INDEL, but most people would probably say “140bp match, 17bp don’t, so they’re 140/157=89.2% identical”.

Not Tomkins though. When I used his parameters in a BLAST of sequence 1 against sequence 2, the result came out to a 100% match over 70bp, meaning that overall, sequence 1 scores as having a 50% match with sequence 2. This is because the BLAST aligned the first 70bp just fine, but then the cost to extend the gap 17bp was too much, so once the cost got too big, the alignment stopped. The other side of the gap wasn’t even considered, even though there’s 70bp of 100% identical sequence, all because of Tomkins’ use of -max_hsps 1. If I relax that to max_hsps 2, I get 2 alignment results, one representing the first 70bp (100% identity), and another representing the second 70bp (100% identity). I’d bet that if Tomkins relaxed this parameter, we’d find that his calculated percentage identity would go up significantly.

The parameters:
-evalue 0.1 -word_size 11 -outfmt “10 qseqid qstart qend mismatch gapopen pident nident length qlen” -max_target_seqs 1 -max_hsps 1 -dust no -soft_masking false -perc_identity 50 -gapopen 3 -gapextend 3 -num_threads 10

I’m glad to see that he’s stopped using -ungapped though, that’s progress…

EDIT: I might have been too hasty, unlike his previous analyses, he seems to have separated out the unaligned sequences, not counting them in his final percentage identity. He has an average of 34% of sequences aligned to the human genome, and of those aligned sequences, they apparently have an identity of 84%. I need to investigate a bit further.


(Blogging Graduate Student) #4

It must be sad existence for Tomkins at AIG, all he seems to spend his time doing is big BLASTs.


(S. Joshua Swamidass) #5

Does he reference the new Ape assemblies that do not use the human genome as a template? Improved ape genome.


#6

Large genomic comparisons aren’t my expertise. With that said, would other tools be a better fit? For example, I input your test sequences into LALIGN and selected the global alignment option and it returned the expected 89.2% without changing the rest of the default settings.

https://embnet.vital-it.ch/software/LALIGN_form.html

Am I wrong in thinking that BLAST is a poor tool for measuring percent identity between large genome sequences?


(Blogging Graduate Student) #7

Yes:

At the time of this publication, a new version of the chimpanzee genome has been announced (PanTro6) that was assembled completely de novo without the use of a human as a reference scaffold (Kronenberg et al. 2018). According to correspondence with UCSC genome browser staff at the time of this report, “The panTro6 assembly has not yet been reviewed by our Quality Assurance team” and is not available for public download. However, LASTZ alignments with the human genome have been performed and are available for download (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/vsPanTro6/). LASTZ is a large-scale genome alignment tool that can efficiently align chromosomal or genomic sequences millions of nucleotides in length.


(Blogging Graduate Student) #8

I’m no expert either. BLAST returns the expected 89.2% using default parameters too, it’s just Tomkins that wants to use -max_hsps 1 for some reason.


(Joel Duff) #9

Thanks so much for providing this explanation and example. It is enormously helpful!


(S. Joshua Swamidass) #10

@glipsnort, your exchange with Buggs on the Human-Chimp similarity was referenced by this article too. Amazing how that works. Thomkins cites Buggs’ analysis as independent evidence that his result is correct. That’s a bit anoying, because that took place when I was first banned from the forum. I could have helped on that one with Buggs.


(Joel Duff) #11

I’m still waiting for Tomkins to do the apples to apples work that I and many others have asked him to. First, I want him to use the same methods and compare two complete human genomes so we have some idea how much difference there is just between humans. He also needs to compare a human and neanderthal genome and he should compare the chimp and gorilla genomes. Remember that most YECs are saying that gorillas and chimps are the same “kind” - the “ape kind” I would love to have Tomkins produce a similarity chart of all of these together. But I doubt he is going to do that because it will undermine most of the points he thinks he is making.


(John Harshman) #12

Aside from the problem already mentioned, there’s the problem that a single indel of 17 bases counts as (at least) 17 differences, when it should count as one difference at most. The common figure of 98.7% sequence identity is for aligned sequences, which doesn’t count indels.

It’s not really kosher to include indels along with aligned sites in a single measure of genetic distance, since the two aren’t comparable. One might try it by constructing an evolutionary model in which the frequencies of indels of various sizes would be parameters alongside the frequencies of the various base changes and then use that to estimate genetic distances, but I don’t think anyone has ever done that. All the low similarity measures (<98%) have assumed that a 100-base indel counts as 100 differences.

Now, I’m not sure Tomkins even gets that far. He’s done a lot of weird things before. I believe that one of his distance measures has just counted the percentage of randomly chosen 30-base segments identical between species — no indels, no base changes — as a distance measure, which is odd in itself and really can’t be compared to any other sort of measure.


(Blogging Graduate Student) #13

My pleasure. However, as I say in the edit at the bottom of my comment, I’m a bit unclear as of yet how what I said applies to this particular analysis of Tomkins’. I know it applies to some of his previous work, for example where he estimated the human-chimp genome identity to be 70% by aligning short (500bp) sequences using -ungapped. In that case, take what I said about the 17bp INDEL in a 140bp sequence and change it to a 1bp INDEL in a 500bp sequence.


#14

As long as the differences are accurately described I don’t have a problem with using a straight base to base comparison. The chimp genome paper has the figures we need for comparisons to Tomkins’ work.

67 million bases of unique DNA in a 3 billion base genome is an additional ~2% difference stacked on top of the 1-2% difference due to substitutions. This puts the total at around 96-97% for a base to base comparison.

As mentioned by others, he used an ungapped analysis which means that a 300 bp comparison that differs by a single indel in the middle of the sequence would be counted as ~75% similar even though it differs by a single base.

At least this time around Tomkins appears to be including gaps, but time will tell if they were correctly accounted for.


#15

Since the past is prologue, it might be worth linking a debunking of Tomkins’ previous attempt at challenging the consensus on the human-chimp genome comparison:

It is interesting that Tomkins’ previous methods actually agreed with the consensus once the bugs were removed and gaps were accounted for properly. It would appear that Tomkins is in search of a method that will give him the results he wants.


(Blogging Graduate Student) #16

Glenn Williamson blogs at roohif.wordpress.com

He’s done quite a few analyses of creationist claims regarding genomics there, especially a lot regarding the human chromosome 2 fusion. He also makes videos e.g this one, about other observed instances of telomere-telomere chromosomal fusions:

I know him pretty well - we talk about these things now and then, and he’s helped me out a lot in the past with some of the bioinformatics involved. I’ve already sent him this new article from Tomkins to have a look at.


(Glenn Williamson) #17

Yoohoo!!

I’ll be having a look at this today for a few hours hopefully, and hopefully have a video published within a week (which I’ll link to here as well).


(John Harshman) #18

I do. Some distance measures make sense, others don’t. Percent difference in aligned sequences estimates the number of point mutations, and thus is also an estimate of time since divergence. That can’t be integrated with the number of bases of indel in any sensible way.


(S. Joshua Swamidass) #19

Welcome @roohif. Glad you could join us.


(Robert Byers) #20

For this YEC dude i welcome as close a genetic likeness with primates as possible.
Humans uniquely are the only creatures who have another creatures bodyplan.
This because we are made in Gods image. Yet Gods biology has controlled options in its blueprint. so its impossible for us to have our own body that represents our true identity. All other creatures bodies show , originally, what they are in essence. So we are renting a bodyplan. the best one for fun and profit.
if we did not have a primate body THEN what possibly could be another option?? Yet staying within code.
Many YEC want to find great gaps between us and primates but why?
We look close enough that squeezing out points one way or another makes no difference.
We have a chimp body. If we were so different we would have a different body and the most from any critter.
yet we alone have the same as another, especially in YEC view.
therefore its proof we can’t have biological frame anymore then God could have one. That should be our equation.