First off, thank you for those who have replied. It has been fruitful. So far I’ve already gained some of that “seeing what you see” that I needed. I mentioned this before, and am willing to repeat: I see that there was more evidence for common descent than what I was previously privy to.
I’ve got a few other questions though, so I’ll call this is part 1(a).
The first reply I received was from @swamidass: His testimony (which I’ve listened two twice, and taken notes).
And also this post:
I read through it, and have a question. (I posted this in the previous thread, but didn’t hear a response)
To ensure I’m understanding this correctly, would this be similar to homology/homoplasy? Where SIFTER would equate to homology and BLAST to homoplasy?
Is that a valid assessment?
Homoplasy is similarity that is NOT consistent with a tree, and therefore NOT well explained by common ancestry alone. These are the exception to the rule of phylogeny.
Homology is similarity that is consistent with a tree, and therefore well explained by common ancestry alone. These are the rule.
Both Homoplasy and Homology are types of similarity. If common descent were true, we would expect to see mainly Homology, with some Homoplasy, and that is what we see.
BLAST is a way of measuring similarity between sequences, but it does not use a tree to do so. So, effectively, it groups Homoplasy and Homology together and makes no distinction between them.
SIFTER group sequences by a tree, so it tries to ignore Homoplasy and focus on Homology.
If similarity was explained merely by common design for common purposes, we expect BLAST to work better than SIFTER. However, we see that SIFTER works better than BLAST. That is what we expect if common descent is true.
Of course, in this case, we are looking at (mainly) microbes. There is no clear reason why creationists should reject common ancestry of microbes. It is really the common ancestry of humans and apes that is really difficult to work through. The evidence for that was in the video I linked you.
Similarly to how phylogeny better predicts function than mere similarity, phylogenetic reconstruction of inferred ancestral proteins outperforms consensus sequences for creating more thermostable and conformationally flexible proteins.
That is, proteins which have been constructed based on phylogenies are are more adaptable (can perform a wider range of functions), can tolerate more mutations, and remain functional at higher temperatures, than proteins constructed based on mere similarities between extant sequences.
Although that sounds interesting, that being as technical as that is, I’m going to have to put that at the bottom of my queue.
And just a note to everyone else: @swamidass has given me one good reply, and I can think of at least three follow-up questions to just that post (when I get a chance). Plus my other thread has multiple posts I have yet to reply to.
I know making these requests doesn’t always work, but can everyone please hold off on any more posts to this thread? I’m certainly not trying to rude, it’s just that I’m triaging these enough as it is.
Ah yes, makes more sense.
BLAST would include both homology and homoplasy, and SIFTER should be more aligned with just homology.
If I have that right, then on to this statement:
“We find that phylogeny informed predictions are two times better than similarity informed predictions.”
I’m trying to understand “better” here. Is it the case that both find that same similarity, only SIFTER culls data with an informed phylogeny, thereby finding it sooner (fewer iterations, searches,…)? Or better in what it finds?
Given a particular DNA sequence (or a protein sequence), we want to know, what is the function of this protein?
We are going to answer this question using a database of proteins where we know the answer the question. We will guess that the unknown protein has the same function as the proteins that are “close” to it. The key thing though is that there are many methods of computing “closeness.”
We can measure the performance of different methods by measuring how accurate they are at guessing the function of an unknown protein.
We are now considering two methods of measuring closeness: SIFTER (tree-based closeness) and BLAST (similarity-based closeness). We can directly measure which one is more accurate at predicting the function of unknown proteins. We find that SIFTER works much better than BLAST.
First a thanks to Jeff for his questions & those who have answered. As someone who is biology and phylogeny illiterate, this has been a very helpful discussion. I hope a couple of questions are appropriate for this discussion.
I am still unclear with the definition of “function” in this context. Can you point me to any somewhat non-technical definition, or even some examples of functions in this context?
How can we directly measure similarity? Or more specifically what objective measure do we use for similarity?
I’m not an expert on these algorithms, so don’t take my word as gospel. Hopefully others will correct any mistakes I make.
The two main categories of protein function are enzyme activity and binding to other molecules. Enzymes are proteins that catalyze chemical reactions, such as breaking apart sugars to drive another chemical reaction that adds a phosphate group to ADP to make ATP. Antibodies bind to foreign substances to guide immune reactions, and also have domains that other proteins from the immune system can bind to.
In one of the SIFTER papers the authors looked at deaminase activity:
This enzyme replaces an amine group with an oxygen on an adenosine molecule (ignore the bars below the molecules):
There are many algorithms that try to align sequences and then determine which positions in the sequence are different. The two algorithms we are discussing in this thread are BLAST and SIFTER. BLAST is pretty straightforward in that it only only looks at the sequences themselves and tries to create the best alignment possible. For example, here is an alignment of human and chicken MMP3 using BLAST:
Score 490 bits
expect 7e-176
Identity 254/481(53%)
Positives 322/481(66%)
Gaps 8/481(1%)
Query 1 MKSLPILLLLCVAVCSAYPLDGAARGEDTSMNLVQKYLENYYDLKKDVKQFVRRKDSGPV 60
MK+L LLLL A+ A+P + E+ M L+QKYLENYY KD + F+ + +S +
Sbjct 1 MKNLQFLLLLYAALSHAFPAHTRQK-EEEGMQLIQKYLENYYSFTKDGESFIWKTNSA-M 58
Query 61 VKKIREMQKFLGLEVTGKLDSDTLEVMRKPRCGVPDVGHFRTFPGIPKWRKTHLTYRIVN 120
KKI+EMQ+F GLEVTG+ DS L++++K RCG PDV F TF G PKW K LTYRI+N
Sbjct 59 AKKIKEMQEFFGLEVTGRPDSSILDLVQKRRCGFPDVAGFSTFAGEPKWAKQVLTYRILN 118
Query 121 YTPDLPKDAVDSAVEKALKVWEEVTPLTFSRLYEGEADIMISFAVREHGDFYPFDGPGNV 180
YTPDL V++A++KA +W VTPL F + G+ADIMISFA H DF PFDGPG
Sbjct 119 YTPDLRPADVNAAIKKAFSIWSSVTPLKFIKRDRGDADIMISFATGGHNDFIPFDGPGGS 178
Query 181 LAHAYAPGPGINGDAHFDDDEQWTKDTTGTNLFLVAAHEIGHSLGLFHSANTEALMYPLY 240
+AHAYAPG GDAHFD+DE WTK T G NLF VAAHE GHSLGLFHS ALMYP+Y
Sbjct 179 VAHAYAPGKDFGGDAHFDEDETWTKSTEGANLFYVAAHEFGHSLGLFHSKEPNALMYPIY 238
Query 241 HSLTDLTRFRLSQDDINGIQSLYGPPP----DSpetplvptepvppepgtpANCDPALSF 296
D + F L QDDINGIQSLYGP P D ++ + EP P +C P L+F
Sbjct 239 RKF-DPSVFPLHQDDINGIQSLYGPSPNTSNDQKDSAEIKDPTESKEPVLPNSCGPDLTF 297
Query 297 DAVSTLRGEILIFKDRHFWRKSLRKLEPELHLISSFWPSLPSGVDAAYEVTSKDLVFIFK 356
DAV+T RGEI+ FKD+HFWRK E + L+S FWP LPSGVDAAYE+ +D + +FK
Sbjct 298 DAVTTFRGEIIFFKDKHFWRKHPAVREVDFDLVSLFWPRLPSGVDAAYEIPEEDKILLFK 357
Query 357 GNQFWAIRGNEVRAGYPRGIHTLGFPPTVRKIDAAISDKEKNKTYFFVEDKYWRFDEKRN 416
GN+FW +RG + GYP+ ++ LGF V KIDAA D+ K K Y+F +K+W +D++
Sbjct 358 GNEFWVVRGETIPPGYPQKLYVLGFSKDVAKIDAAFYDRNKGKAYYFTANKFWSYDKRNQ 417
Query 417 SMEPGFPKQIAEDFPGIDSKIDAVfeefgffyffTGSSQLEFDPNAKKVTHTLKSNSWLN 476
S++ P+ I + FPGI++ IDAVF+ F YFF G Q EFDP+ K+VT LK+N W +
Sbjct 418 SVDRK-PRLIKDAFPGINANIDAVFQYENFLYFFQGRKQFEFDPDKKRVTRLLKTNFWFS 476
Query 477 C 477
C
Sbjct 477 C 477
You have the two sequences and the shared amino acids listed in the middle of them. You will notice that there is a gap in the sequence marked by dashes (GPPP----DSpetplvpte). The algorithm introduced that gap to increase the overall match between sequences. The algorithm also tells you how many positive matches there are (including chemically similar amino acids marked by +) and what the overall similarity is. The score (if I understand it correctly) is the algorithm’s measurement of how close the two sequences are to one another.
SIFTER also measures closeness but it includes phylogenetic information which BLAST does not.
Thanks, that clarifies things a lot. I appreciate you taking the time to teach what I probably should have learned in high school. I find it a lot more interesting now than I did then.
Thank you. That does make it a little clearer, getting closer to grasping it. I still feel like I need a more solid understanding in order to assess its potential for evaluating descent vs design. Btw that’s not any kind of ‘dismissal’, because I really am trying to follow this.
Also I’m sorry it’s taking me a while to get back to this. And I feel like I may fall behind in class here…
(BTW I’ve given up on that other thread of mine, I just can’t keep up).
@swamidass, you were gracious to try and keep this thread paced for me, but since I’ve been real busy lately (especially this week), I’m probably going to be going too slow for others. Not sure what to do about that. Should I drop off and let others discuss it?
Perhaps this could help me: Is there an analogy here within my domain: software development, that might help me follow this one more accurately?
Given:
There are many different types of software application (and functions within them): reading/writing to DBs, communicating over networks, many varieties of User Interfaces, Desk-top, phone web. All these within a nested hierarchy.
Software lines of code compile into machine code (similar to DNA).
With that in mind, would this be similar: Needing to search through existing machine code to find the function of various sections within it, and using other existing apps to assist. One (like BLAST), does pure machine-code matching, another (like SIFTER) uses commonality with a nested hierarchy to try and find matches.
In general, computer code is so limited a parallel, that it quickly becomes misleading. One issue is that, unlike the DNA we find in biology, computer code just does not follow a tree like structure at all. So no one has proposed something like SIFTER to find similar code, and no one expect that such an algorithm would work terribly well.
To reiterate, computer software does not fit in nested hierarchy at all really.
Very gracious of you, thank you. I suppose I need to step it up then…
Ok, so I have an idea here. Instead of me trying to gain a much better understanding of the models (which would really slow things down), let me provide my initial feelings contrary to this approach. I’m sure you’ve thought these through, so I’ll let you respond to them as follows:
Modeling God accurately: According to the article at OUP.com, the creators of SIFTER modeled their algorithm off of phylogeny. I’m assuming they did their best to model it accuratly. Then the article says: “BLAST (27) is a sequence matching method”. And the obvious question is: Does “sequence matching” represent God accurately? Followed by: even if it might, did these creators model God accurately? To me, that’s a big pill to swallow. I don’t know if I can answer that. Or if I can take other people’s word that “yes it does”.
Correlation vs. Causality & algorithm performance: I can certainly see how SIFTER performs better than BLAST. It applies an added heuristic in order to improve search results. But this is not causality, it’s simply correlation. Grouping items together by similarity, then running a search algorithm based on that will almost always perform better, even if it’s simply a correlation. All developers know that linear sequence matched algorithms are the least performant searches.
Circularity. Using phylogeny to prove phylogeny to me seems a bit circular. But I am open to other views on that.
Only (or mainly?) for microbes As compelling as it might be, I would like to see this broaden to more than microbes. Technology is certainly advancing, at quite a rapid pace. It’s great to see. I’m not being dismissive here, but I am saying that I don’t mind holding off on a conclusion on something like this, simply because more data is continually coming in.
Validates ID?? Just a final comment: So if BLAST did accurate model God in a testable way and thereby support common descent over design, doesn’t it also validate design test-ability.
To summarize: it makes sense to me that SIFTER performs better even with design (based on that heuristic), and I don’t feel like BLAST represents God accurately.
So…I’m hoping posting that will help speed things up a bit. Tell me your thoughts.
Lastly real quick: although there’s not a lot gained to debate it, I have to respectfully disagree. We can define a nested hierarchy of software. This goes back to point #2 a bit. If I did want to create a similar application for searching machine code, first creating a nested hierarchy of software applications, then using that as a heuristic in the search algorithm would improve its performance.
Is it possible you are misunderstanding this test?
We are not considering God vs not God, or designed vs. not designed. Rather, we are considering whether or not similarity in protein sequences are better explained by common descent or common function.
None of this rules out God’s role in common descent. None of this models God. None of this rules out design.
Is it possible that you do not know what a nested hierarchy is?
Let me give you an example to consider. Imagine several software programs that do or do not important specific modules:
| | M1 | M2 | M3 |
|—|—|—|—|—|—|
| P1 | X | X | X
| P2 | X | X
| P3 | | | X |
| P4 | | X | X |
You can extend this table, even with real data. Though I have not specified specific programs and modules here, you should know this is hardly an unreasonable example.
Can you put these programs into nested clades, without any discordant features? If you can, they fit into nested clades. If you can’t, they don’t.
You will find that you can’t put them into nested clades. And, if we increased the number of modules and the number of programs, the problem would get harder. More and more features would be discordant in more and more places.