Anyone want to comment on this?
Missing key controls. Obvious error too.
If you mean the use of ā-max_hsps 1ā Iām not sure itās an error, as Tomkins goes out of his way to use it and has done so repeatedly.
For those who donāt know, setting max_hsps to 1 means that the BLAST will only return a single alignment result per query sequence. I played around with this myself. I made 2 random sequences. The first is 140bp long:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCACTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG
The second is exactly identical, but with a 17bp sequence inserted into the middle of it (bold), so it has 70bp identical to the first sequence, the 17bp insertion, and then another 70bp identical to the first sequence:
ATCCTCTTTGAAAAGAAGGACAAACAACAACAATTAAGTACTTGTTTTGTGCAATATACTGTGATAGGCAGCGTCATGCTTGACTGTCTGGTGGTACAAAGACAAAAATAAGATAGTCCCTGCCCTCAAGGAATTTACATTCAACTAGGGAGGATAG
There are a couple of ways you could calculate the identity between the 2 sequences. First, you could could say theyāre 100% identical with a single INDEL, but most people would probably say ā140bp match, 17bp donāt, so theyāre 140/157=89.2% identicalā.
Not Tomkins though. When I used his parameters in a BLAST of sequence 1 against sequence 2, the result came out to a 100% match over 70bp, meaning that overall, sequence 1 scores as having a 50% match with sequence 2. This is because the BLAST aligned the first 70bp just fine, but then the cost to extend the gap 17bp was too much, so once the cost got too big, the alignment stopped. The other side of the gap wasnāt even considered, even though thereās 70bp of 100% identical sequence, all because of Tomkinsā use of -max_hsps 1. If I relax that to max_hsps 2, I get 2 alignment results, one representing the first 70bp (100% identity), and another representing the second 70bp (100% identity). Iād bet that if Tomkins relaxed this parameter, weād find that his calculated percentage identity would go up significantly.
The parameters:
-evalue 0.1 -word_size 11 -outfmt ā10 qseqid qstart qend mismatch gapopen pident nident length qlenā -max_target_seqs 1 -max_hsps 1 -dust no -soft_masking false -perc_identity 50 -gapopen 3 -gapextend 3 -num_threads 10
Iām glad to see that heās stopped using -ungapped though, thatās progressā¦
EDIT: I might have been too hasty, unlike his previous analyses, he seems to have separated out the unaligned sequences, not counting them in his final percentage identity. He has an average of 34% of sequences aligned to the human genome, and of those aligned sequences, they apparently have an identity of 84%. I need to investigate a bit further.
It must be sad existence for Tomkins at AIG, all he seems to spend his time doing is big BLASTs.
Does he reference the new Ape assemblies that do not use the human genome as a template? Improved ape genome.
Large genomic comparisons arenāt my expertise. With that said, would other tools be a better fit? For example, I input your test sequences into LALIGN and selected the global alignment option and it returned the expected 89.2% without changing the rest of the default settings.
https://embnet.vital-it.ch/software/LALIGN_form.html
Am I wrong in thinking that BLAST is a poor tool for measuring percent identity between large genome sequences?
Yes:
At the time of this publication, a new version of the chimpanzee genome has been announced (PanTro6) that was assembled completely de novo without the use of a human as a reference scaffold (Kronenberg et al. 2018). According to correspondence with UCSC genome browser staff at the time of this report, āThe panTro6 assembly has not yet been reviewed by our Quality Assurance teamā and is not available for public download. However, LASTZ alignments with the human genome have been performed and are available for download (ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/vsPanTro6/). LASTZ is a large-scale genome alignment tool that can efficiently align chromosomal or genomic sequences millions of nucleotides in length.
Iām no expert either. BLAST returns the expected 89.2% using default parameters too, itās just Tomkins that wants to use -max_hsps 1 for some reason.
Thanks so much for providing this explanation and example. It is enormously helpful!
@glipsnort, your exchange with Buggs on the Human-Chimp similarity was referenced by this article too. Amazing how that works. Thomkins cites Buggsā analysis as independent evidence that his result is correct. Thatās a bit anoying, because that took place when I was first banned from the forum. I could have helped on that one with Buggs.
Iām still waiting for Tomkins to do the apples to apples work that I and many others have asked him to. First, I want him to use the same methods and compare two complete human genomes so we have some idea how much difference there is just between humans. He also needs to compare a human and neanderthal genome and he should compare the chimp and gorilla genomes. Remember that most YECs are saying that gorillas and chimps are the same ākindā - the āape kindā I would love to have Tomkins produce a similarity chart of all of these together. But I doubt he is going to do that because it will undermine most of the points he thinks he is making.
Aside from the problem already mentioned, thereās the problem that a single indel of 17 bases counts as (at least) 17 differences, when it should count as one difference at most. The common figure of 98.7% sequence identity is for aligned sequences, which doesnāt count indels.
Itās not really kosher to include indels along with aligned sites in a single measure of genetic distance, since the two arenāt comparable. One might try it by constructing an evolutionary model in which the frequencies of indels of various sizes would be parameters alongside the frequencies of the various base changes and then use that to estimate genetic distances, but I donāt think anyone has ever done that. All the low similarity measures (<98%) have assumed that a 100-base indel counts as 100 differences.
Now, Iām not sure Tomkins even gets that far. Heās done a lot of weird things before. I believe that one of his distance measures has just counted the percentage of randomly chosen 30-base segments identical between species ā no indels, no base changes ā as a distance measure, which is odd in itself and really canāt be compared to any other sort of measure.
My pleasure. However, as I say in the edit at the bottom of my comment, Iām a bit unclear as of yet how what I said applies to this particular analysis of Tomkinsā. I know it applies to some of his previous work, for example where he estimated the human-chimp genome identity to be 70% by aligning short (500bp) sequences using -ungapped. In that case, take what I said about the 17bp INDEL in a 140bp sequence and change it to a 1bp INDEL in a 500bp sequence.
As long as the differences are accurately described I donāt have a problem with using a straight base to base comparison. The chimp genome paper has the figures we need for comparisons to Tomkinsā work.
67 million bases of unique DNA in a 3 billion base genome is an additional ~2% difference stacked on top of the 1-2% difference due to substitutions. This puts the total at around 96-97% for a base to base comparison.
As mentioned by others, he used an ungapped analysis which means that a 300 bp comparison that differs by a single indel in the middle of the sequence would be counted as ~75% similar even though it differs by a single base.
At least this time around Tomkins appears to be including gaps, but time will tell if they were correctly accounted for.
Since the past is prologue, it might be worth linking a debunking of Tomkinsā previous attempt at challenging the consensus on the human-chimp genome comparison:
It is interesting that Tomkinsā previous methods actually agreed with the consensus once the bugs were removed and gaps were accounted for properly. It would appear that Tomkins is in search of a method that will give him the results he wants.
Glenn Williamson blogs at roohif.wordpress.com
Heās done quite a few analyses of creationist claims regarding genomics there, especially a lot regarding the human chromosome 2 fusion. He also makes videos e.g this one, about other observed instances of telomere-telomere chromosomal fusions:
I know him pretty well - we talk about these things now and then, and heās helped me out a lot in the past with some of the bioinformatics involved. Iāve already sent him this new article from Tomkins to have a look at.
Yoohoo!!
Iāll be having a look at this today for a few hours hopefully, and hopefully have a video published within a week (which Iāll link to here as well).
I do. Some distance measures make sense, others donāt. Percent difference in aligned sequences estimates the number of point mutations, and thus is also an estimate of time since divergence. That canāt be integrated with the number of bases of indel in any sensible way.
Welcome @roohif. Glad you could join us.
For this YEC dude i welcome as close a genetic likeness with primates as possible.
Humans uniquely are the only creatures who have another creatures bodyplan.
This because we are made in Gods image. Yet Gods biology has controlled options in its blueprint. so its impossible for us to have our own body that represents our true identity. All other creatures bodies show , originally, what they are in essence. So we are renting a bodyplan. the best one for fun and profit.
if we did not have a primate body THEN what possibly could be another option?? Yet staying within code.
Many YEC want to find great gaps between us and primates but why?
We look close enough that squeezing out points one way or another makes no difference.
We have a chimp body. If we were so different we would have a different body and the most from any critter.
yet we alone have the same as another, especially in YEC view.
therefore its proof we canāt have biological frame anymore then God could have one. That should be our equation.