Then there’s something I’m grossly misunderstanding. I thought the goal was to predict protein function, and the point was that BLAST, which uses only sequence similarity, is inferior at this to SIFTER, which uses the phylogenetic tree.
That’s the point.
But that fits what I said, not what you said.
???
I take @swamidass to be correct here. One might set out to deliberately build some rudimentary exception, but in the real world, software violates nested hierarchy incessantly. Consider computer languages, such as Basic, C, Python, Java, all as representing independent branches. Along comes a new technology to incorporate, say USB or 5G or latest video card. All these separate programming branches can be, and routinely are, simultaneously updated with the new capability or application programming interfaces. Biological nested hierarchies cannot do this. There is no rolling out a new industry standard for animal flight or vision, which is deployed throughout the animal kingdom. This is a crucial distinction between technology and the tree of life which seems to constantly missed by YEC. Nested hierarchies are not defined just by the presence of traits, but as well by the constraints of the nesting.
But what about convergence? Doesn’t that demonstrate that new features can be added to distant branches? Actually, convergence validates the basic principle that adaptations are not transferrable across branches [other than HTG]. When bats decided they wanted wings, they were not free to license hawk feathers, they had to work with what they had. Whales could not just order up off the shelf hydrofoils from sharks. We humans cannot just upgrade our eyesight to best in class. Sonar has different heritage and operation in whales and bats. Rarely in biological convergence are the constraints of lineage not at play. It is both the presence of features and the absence of traits which define a phylogeny.
Compare that to technology. How often have I seen YEC devised trees branching off to vehicles to cars to trucks to back hoes to farm combines and whatever, in order to diminish phylogeny as evidence for common descent. But transportation branches are not at all independent. One year, those vehicles do not have intermittent wipers, the next they all do. Every branch suddenly ceases to feature an epidermis of lead paint, or to consume leaded gasoline. Seat belts make their way across every branch. None have back up cameras, then all are so equipped. Internal organs such as carburetors are ripped out across the board in favor of fuel injection. Skins of steel become composites in cars and trucks. I may have belabored the point, but this is an important distinction; technology has always been and always will be a crisscrossing web, whereas nature has always and ever been a forking tree. Software is no exception, and in fact violating nested hierarchy is the whole point of object oriented programming and plug-ins. They exist to be utilized in otherwise unrelated programs. The analogy between technology and phylogeny is a deeply and fundamentally flawed fallacy. The reason is very simple - the constraints of biological descent.
If there’s a correlation between a phylogenetic heuristic and better search results, why might that be the case, if it’s not causal?
This isn’t what’s happening. We’re demonstrating that a phylogenetic heuristic leads to improved predictions of protein function, and this evidence that the phylogenies have utility because they likely reflect the reality of protein sequence evolution.
If you test a prediction of a hypothesis and see that it’s accurate, that finding is evidence in favour of the hypothesis being accurate. There’s nothing circular there.
Truthfully, I could have given a more educated response on #1. Correct, it’s not intended to ‘Model God’.
#1 really served to, as I mentioned, get some dialog going.
You have a lot of visitors here, and for some of them that might be their first impression at the first pass on that article. So if anything it serves as a chance to clear that up. To word this one better, the first step in this process is to trust the question: Is this really a good test of descent vs. God’s direct design? Personally I haven’t strongly settled on that answer. And that’ s a hindrance (BUT, let’s table that topic).
As for the other points: I really didn’t believe strongly on #3, and really don’t care either way on #4. So those are worth dropping.
#2 is where I really want to focus on. So I’d like to proceed with that.
But to do so, I first want to respond to this:
Actually you make a valid point when you accurate described what we call a dependency graph. Those certainly are different from nested hierarchies. In fact, I’m glad you mentioned those. I have a follow-up post related to that (that would be Part (2) ).
Understand that I’m using an analogy for sake of communication. When I mentioned nested hierarchies for software, I meant something along these lines:
The first branch could be:
Console apps, web services, user interfaces.
User interfaces could be branched into
Webpages, phone apps, desktop apps.
Desktop apps (written in C#) could be branched into:
WinForms, WPF, UWP
I meant branching like that.
So again, following that analogy, if I were to write a program to scan existing machine code, and wanted to use existing apps, organizing them this way, and using that as a heuristic would improve the performance of my algorithm.
For example, if an app had code that functions as an event queue (mouse events for example), then only searching other user interfaces would be the best approach to find that match.
Getting back to #2 above (since again, that’s really what I want to focus on), I can see the value of using nested hierarchies, regardless of whether or not there was an actual ancestral relationship between them.
(LASTLY: Since I wrote this, I noticed there were some other posts. Just fyi, I have not had the time to read those yet, so this is just a reply to Joshua’s response.)
What you would need to explain is why God’s direct design would look exactly like common ancestry and evolution when there is no discernable reason why it would have to be that way.
To use an analogy, we can track the movement of planets around the Sun, and they follow the path predicted by Newton’s and Einstein’s equations. We take that as evidence for the accuracy of those theories. However, what is to stop someone from asking if that is a valid test of gravity vs. the direct design of invisible pink fairies? It could be that invisible pink fairies move planets about the Sun in a path that just happens to mimic Newton’s and Einstein’s theories.
For this reason, science uses the law of parsimony. When the evidence is consistent with a natural process we don’t toss out that natural explanation just because the evidence could also be consistent with a supernatural process that exactly mimics the proposed natural process. In fact, we would have to throw out every single theory in science if we did not have this law of parsimony.
The ends of the branches need to be the individual programs, and you would need to show that the features you describe are distributed among those programs in a tree like manner.
The problem here is that there are programs that are nearly identical in their function and appearance but use completely different machine code. For example, Google Chrome looks the same on a PC and a Mac, but the machine code underneath is very different.
@swamidass, as slow as I am to replying, I certainly don’t want to rush you. But I did want to see if you could respond to this one. I see this thread is about to get closed out tomorrow, and just wanted to check with you before it did.
This is a great thread. Thanks for starting and continuing it
I just increased the timer to give @swamidass more time to respond
Thank you Michelle!
I’d like to wrap up this thread and move on to other topics.
@swamidass, if you’d like to reply, I’ll read your response, but basically I don’t see BLAST vs SIFTER performance result differences to be very compelling in regard to belief in common descent. I may not have a strong biology background, but I have studied algorithms in graduate school. BTW that included studying (and writing) genetic algorithms, which to me are another reason to reject belief in biological evolution (but that’s a side topic).
Basically it looks like BLAST performs a more sequential search on a large group of data (which is known to be less performant) whereas SIFTER includes a heuristic (reducing the search space by selecting things more closely related). That’s certainly going to improve the performance. And that improvement is going to exist regardless of why we group things closely: common descent or common design.
@swamidass, I do appreciate you requesting this thread be for just us two, but I’m good with opening up this thread to others for discussion. I have other phylogeny questions I’d like to get to when I get the chance…
Sure.
That isn’t true at all, on either point. Neither BLAST nor SIFTER performs a sequential search. BLAST chooses things closer in sequence space, but SIFTER chooses things closer in the tree. This also isn’t about speed (and SIFTER does not reduce the search space). Rather it is about how accurate they are, not how fast. BLAST is far faster (more performant) than SIFTER, by the way.
If the tree isn’t a tree of actual past relationships, what is it?
Perhaps you would like to take a shot at explaining why SIFTER performs better than blast at predicting function without appealing to common descent?
Another way of phrasing @swamidass excellent question is: If the tree isn’t a good approximation of reality, why is it not only a useful guide to predicting and understanding functions of proteins, but better than similarity is?
I was under the impression it was a sequential search. I had to go back and re-read the oup.com article, and now I see why I thought that. It says the following:
BLAST ([27]) is a sequence matching method,
Ok, so the algorithm is not sequential, but it’s the data that is being matched via similar ‘sequences’. Is that a better understanding?
So, regarding use of the phrase “performant”, typically that’s a reference to speed (or # of iterations), but could mean “better results”. All right, so it’s not an issue of fewer iterations, or speed, but quality of data. With that, help me understand “performant” in that sense. Or…help me understand this:
We find that phylogeny informed predictions are two times better than similarity informed predictions.
What is meant by “two times better”?
Yes.
Not here. My OP should make that clear. The studies themselves don’t even address run time, except maybe in passing. Certainly none of the figures I referenced have anything to do with run time.
SIFTER is 2x more accurate at predicting the function of unknown proteins, but it is far slower at making predictions than BLAST.
I believe he’s asking you to describe the actual metric used.
I’m not sure. He thought I meant SIFTER works faster, not more accurately. If he wants to understand the metric, I’m sure it is already explained quite clearly in several places.
It seemed quite clear from the paper and related figures in your OP.
Yes, originally, but as I mentioned:
Fair enough. I’ll go back and re-read the article (and supporting links) with a little more attentiveness. Patients please…
Ok, I’ve done some reading up on this. I suppose I need to start from the very beginning. My biology knowledge is not as strong as my computer science knowledge.
I came across the following here:
From: Protein Molecular Function Prediction by Bayesian Phylogenomics
It is broadly recognized that this method produces high-quality results for annotating proteins with specific molecular functions [16]. Three problems limit its feasibility for universal application. First, phylogenomic analysis is a labor-intensive manual process that requires significant effort from dedicated scientists. Second, the quality of the predictions depends on the expertise of the scientist performing the annotation and the quality and availability of functions for the homologous proteins. Third, phylogenomics does not provide a consistent methodology for reporting when a function has insufficient support because of sparse, conflicting, or evolutionarily distant evidence. These three problems motivate the development of a statistical methodology for phylogenomics.
So what I’m gathering here is that protein function identification (or as they call it “annotating”) was a labor-intensive experimental process. Now algorithms (such as SIFTER and BLAST) assist with the process.
So here’s an important question for me: Are the results of the algorithms verified experimentally? And if so, if I’m reading this correctly, one means of validating these algorithms is by running them against previously determined protein function (that is previously determined experimentally). Correct?