It was a large sample, not a small sample, that the algorithm is allowed to see. The holdout set is small.
And just FYI, that’s a massive improvement in most subfields of computational biology. Most papers will show robust gains of only just 5 or 10% over a 75% baseline, and that is still a very good result.
Jumping from 75% to 95% only happens if you have really found a radically different and powerful insight into the problem. In this case they did: inferred phylogenetic history is more informative than sequence similarity.
This also illustrates nicely that similarity is not the same thing as a tree. Because if they were the same thing, you would expect them to have identical performance.
Part of that database is a phylogeny, and for 3% of that phylogeny they were able to assign function. What “assign function to 3%” means isn’t entirely clear from my brief reading of the paper, but I would imagine it indicates coverage of the species in the tree. With that small amount of coverage, SIFTER was able to predict the function of the “unknown” genes they used in the test with 96% accuracy.
The paper gets pretty dense in parts, but here is a snippet from one section and a figure from later in the paper:
I’m afraid that was opaque to me. The database appears to be a protein database with annotated functions. The phylogeny is a separate input. And what does “they were able to assign function” mean? Is this prior to the analysis? How was accuracy determined if they couldn’t assign function to the proteins in the analysis?
The quote also is largely opaque.
Yes, but then the way you phrased the next part made it seem like they erased most of the information about functions in the sample, and then from a small subset were able to predict the rest. I just got the sense that you had a different idea of the relative scale in the erased and intact parts in mind. But from what Swamidass said earlier I got the opposite impression.
So they erased a small subset of the information and from the intact majority of the sample they were able to predict the few erased bits?
That is my impression yes.
It’s a very common technique for for training a model on a set of data. You take data and break it up into sets - a set with which you’ll train the model and set with which you’ll validate the performance of the model.
If I am understanding it correctly, the database has annotations for 3% of the species in the phylogeny.
From the paper:
5 of the 128 proteins (1 protein per species I would assume) in the database had annotations from empirical measurements of function. They placed those known proteins in the phylogeny and used SIFTER to predict the function of the other 123 proteins. A literature search independent of the database was able to find function for an additional 28 proteins based on empirical measurements of function. Of those, SIFTER was able to accurately predict function in 96% of them, and the other methods were less accurate.
Now that makes sense.
Agreed, but now to dig a little more.
What does it mean that SIFTER includes phylogenetic information which BLAST does not.
I think what I really mean here is. What phylogenetic information does SIFTER incorporate, and how does it do that?
SIFTER constructs a tree, a reconstruction of the past, and uses that tree to make inferences. BLAST does not make use of a tree.
Just to complete the point, BLAST can derive trees from its alignments; they are far from the best trees mathematically, but they can be sufficient to illustrate some phylogenetic points.
The way they were using BLAST did not construct a tree of any sort.
I know, but if @jeffb explores BLAST more deeply, he’ll find that a tree option is offered with the results.
Turns out you were correct. From the paper.
SIFTER predicts function for each domain of a protein separately, using the phylogenetic tree of the family of each domain.
Good catch. Thank you.
This new paper is possibly relevant to the discussion around using phylogeny to guide inferences about protein function:
Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.
Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k -mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.
OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at GitHub - DessimozLab/omamer: OMAmer - tree-driven and alignment-free protein assignment to sub-families.
Got a question.
First: We’re using this to assist in the common design/common descent question. The proposal is that SIFTER performs better than BLAST, and SIFTER is based on trees constructed by (implied) evolutionary phylogeny.
Questions: Does BLASTs lower performance really rule out design? BLAST just sounds like a rather simplistic algorithm to begin with. Partially indicated by its faster run-time. I’m not very convinced the BLAST represents common design very well, which is key to this comparison.
As I eluded to early, I certainly see how the SIFTER algorithm performed better simply because it was implementing a heuristic. Even as a creationist, I could see wanting to implement something like this.
I’d say for me, to have this really be apples-to-apples, we would need to run SIFTER in a similar manner, only against a set of data grouping together organisms (or genes in this case apparently) simply due to some commonality (not necessarily evolutionary). Grouped by those who adhere to design. Then feed that to the model. After all, there may be a bias that exists simply in the SIFTER application itself.
Yes. And the argument here is that it is the tree that makes it perform better, because the tree more accurately reflects something about reality than mere degree of sequence-similarity does.
That depends on what you mean by design. In it’s broadest possible construal, design is compatible with all conceivable observations. After all we can imagine a designer that in principle desires to “design” life using a sort of “blind chance” evolutionary process even without any guidance. A “design” that was implemented at the level of the laws of physics, and that these laws gave rise to life and facilitated it’s evolution.
Unless you’re going to put more meat on what you put into that word “design”, and give it some constraints, then nothing can really be ruled out. Design has both be something, and not be something else.
I agree only because it’s not clear what “common design” actually means, other than it having something vaguely to do with things being similar.
So when you use this term “common design”, what are you really saying?
What does it mean to say, for example (and relevantly to the topic of this thread about SIFTER vs BLAST) that a collection of protein sequences share a “common design”? When you say that, what are you saying about those protein sequences? How were they designed and what should we expect from the data when we analyze protein sequences that were “commonly designed”?
To the extend you are seeing a lack of test of “design” or “common design” it is only because creationists continually refuse to actually specify what that means.
BLAST implements a heuristic. It may be a simpler one than SIFTER to be sure, but if there’s some sort of creationist method of function-inference that can be employed algorithmically to predict the functions of protein sequences, please provide one.
Meanwhile, what we have is that a tree-informed method works better than one without the tree. That implies the data supports a tree better than no tree. Which should cause a curious person to wonder what a good explanation for that might be. One of those could be that there really is a tree, and that similar protein sequences share a real genealogical history of branching descent with modification.
You’re welcome to offer a better one.
Great. Contact those who adhere to design and ask them to come up with a method of “grouping together genes(or organisms) simply due to some commonality”. It’s high time. 160 years since Darwin and still nothing.
Yes, definitely. That bias is the tree. Very explicitly and intentionally designed to use the tree to bias it’s inference of function based on the functions of the closest relatives on a phylogenetic tree.
I would say that we are testing the much more discrete hypothesis that the similarity we observe was produced by common design, not common descent.
Do you see the difference?