JeffB and Swamidass: Understanding Evidence for Phylogeny

swamidass · March 25, 2021, 6:41pm

It was a large sample, not a small sample, that the algorithm is allowed to see. The holdout set is small.

swamidass · March 25, 2021, 6:48pm

And just FYI, that’s a massive improvement in most subfields of computational biology. Most papers will show robust gains of only just 5 or 10% over a 75% baseline, and that is still a very good result.

Jumping from 75% to 95% only happens if you have really found a radically different and powerful insight into the problem. In this case they did: inferred phylogenetic history is more informative than sequence similarity.

This also illustrates nicely that similarity is not the same thing as a tree. Because if they were the same thing, you would expect them to have identical performance.

T_aquaticus · March 25, 2021, 7:42pm

Part of that database is a phylogeny, and for 3% of that phylogeny they were able to assign function. What “assign function to 3%” means isn’t entirely clear from my brief reading of the paper, but I would imagine it indicates coverage of the species in the tree. With that small amount of coverage, SIFTER was able to predict the function of the “unknown” genes they used in the test with 96% accuracy.

The paper gets pretty dense in parts, but here is a snippet from one section and a figure from later in the paper:

We chose these 100 families to meet one of the following two criteria: (1) greater than 10% proteins with experimental annotations (and more than 25 proteins), or (2) more than nine experimental annotations. Families with fewer than two incompatible experimental GO functions were excluded. The families had an average of 235 proteins, ranging from 25 to 1,116 proteins. On average, 3.3% of the proteins in a family had IDA annotations, and 0.4% had IMP annotations. Both SIFTER and Orthostrapper relied on this particularly sparse dataset for inference; evaluative techniques involving the removal of any of these annotations from inference tended to trivialize the results (e.g., removing a lone experimental annotation for a particular function did not enable that function prediction for homologous proteins). Selecting well-annotated families via these criteria assists SIFTER, but it should also enhance the performance of all of the function transfer methods evaluated here. Note also that SIFTER does not require this level of annotation accuracy to be effective, as discussed below. Finally, it is important to note that many of the IEA annotations from the GOA database may come from one of the assessed methods, so we can expect consistency to be quite high.

Figure 3. Results for Pruned Version of the AMP/Adenosine Deaminase Family

The reconciled phylogeny used in inference is shown, along with inferential results (both the posterior probabilities for the deaminase substrates and the function prediction based on the maximum posterior probability). Eight of the proteins in this tree were annotated with growth factor activity, with the second highest probability being adenosine deaminase. The function observations used for inference are denoted by filled boxes to the left of the column with the posterior probabilities. For each substrate specificity that arises, a single edge in the phylogeny identifies a possible location for that mutation. The highlighted sequences are discussed in the text. The blue vertices represent speciation events and the red vertices represent duplication events. The tree was rendered using ATV software, version 1.92 [68].

Protein Molecular Function Prediction by Bayesian Phylogenomics | PLOS Computational Biology

Protein Molecular Function Prediction by Bayesian Phylogenomics

John_Harshman · March 25, 2021, 8:57pm

I’m afraid that was opaque to me. The database appears to be a protein database with annotated functions. The phylogeny is a separate input. And what does “they were able to assign function” mean? Is this prior to the analysis? How was accuracy determined if they couldn’t assign function to the proteins in the analysis?

The quote also is largely opaque.

Rumraket · March 26, 2021, 1:20am

Yes, but then the way you phrased the next part made it seem like they erased most of the information about functions in the sample, and then from a small subset were able to predict the rest. I just got the sense that you had a different idea of the relative scale in the erased and intact parts in mind. But from what Swamidass said earlier I got the opposite impression.

John_Harshman · March 26, 2021, 1:41pm

So they erased a small subset of the information and from the intact majority of the sample they were able to predict the few erased bits?

Rumraket · March 26, 2021, 2:19pm

That is my impression yes.

davecarlson · March 26, 2021, 2:37pm

It’s a very common technique for for training a model on a set of data. You take data and break it up into sets - a set with which you’ll train the model and set with which you’ll validate the performance of the model.

T_aquaticus · March 26, 2021, 2:49pm

If I am understanding it correctly, the database has annotations for 3% of the species in the phylogeny.

From the paper:

5 of the 128 proteins (1 protein per species I would assume) in the database had annotations from empirical measurements of function. They placed those known proteins in the phylogeny and used SIFTER to predict the function of the other 123 proteins. A literature search independent of the database was able to find function for an additional 28 proteins based on empirical measurements of function. Of those, SIFTER was able to accurately predict function in 96% of them, and the other methods were less accurate.

John_Harshman · March 26, 2021, 7:02pm

Now that makes sense.

cdods · March 27, 2021, 2:45pm

Agreed, but now to dig a little more.

What does it mean that SIFTER includes phylogenetic information which BLAST does not.
I think what I really mean here is. What phylogenetic information does SIFTER incorporate, and how does it do that?

swamidass · March 27, 2021, 2:53pm

SIFTER constructs a tree, a reconstruction of the past, and uses that tree to make inferences. BLAST does not make use of a tree.

Mercer · March 27, 2021, 5:15pm

Just to complete the point, BLAST can derive trees from its alignments; they are far from the best trees mathematically, but they can be sufficient to illustrate some phylogenetic points.

swamidass · March 27, 2021, 6:43pm

The way they were using BLAST did not construct a tree of any sort.

Mercer · March 28, 2021, 8:07pm

I know, but if @jeffb explores BLAST more deeply, he’ll find that a tree option is offered with the results.

swamidass · March 30, 2021, 4:22pm

Turns out you were correct. From the paper.

SIFTER predicts function for each domain of a protein separately, using the phylogenetic tree of the family of each domain.

Good catch. Thank you.

davecarlson · March 31, 2021, 7:49pm

This new paper is possibly relevant to the discussion around using phylogeny to guide inferences about protein function:

Motivation

Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive.

Results

Here, we first show that in multiple animal and plant datasets, 18 to 62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily-informed k -mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND.

Availability

OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at GitHub - DessimozLab/omamer: OMAmer - tree-driven and alignment-free protein assignment to sub-families.

jeffb · April 1, 2021, 7:35pm

Got a question.

First: We’re using this to assist in the common design/common descent question. The proposal is that SIFTER performs better than BLAST, and SIFTER is based on trees constructed by (implied) evolutionary phylogeny.

Questions: Does BLASTs lower performance really rule out design? BLAST just sounds like a rather simplistic algorithm to begin with. Partially indicated by its faster run-time. I’m not very convinced the BLAST represents common design very well, which is key to this comparison.

As I eluded to early, I certainly see how the SIFTER algorithm performed better simply because it was implementing a heuristic. Even as a creationist, I could see wanting to implement something like this.

I’d say for me, to have this really be apples-to-apples, we would need to run SIFTER in a similar manner, only against a set of data grouping together organisms (or genes in this case apparently) simply due to some commonality (not necessarily evolutionary). Grouped by those who adhere to design. Then feed that to the model. After all, there may be a bias that exists simply in the SIFTER application itself.

Rumraket · April 1, 2021, 10:27pm

Yes. And the argument here is that it is the tree that makes it perform better, because the tree more accurately reflects something about reality than mere degree of sequence-similarity does.

That depends on what you mean by design. In it’s broadest possible construal, design is compatible with all conceivable observations. After all we can imagine a designer that in principle desires to “design” life using a sort of “blind chance” evolutionary process even without any guidance. A “design” that was implemented at the level of the laws of physics, and that these laws gave rise to life and facilitated it’s evolution.

Unless you’re going to put more meat on what you put into that word “design”, and give it some constraints, then nothing can really be ruled out. Design has both be something, and not be something else.

I agree only because it’s not clear what “common design” actually means, other than it having something vaguely to do with things being similar.

So when you use this term “common design”, what are you really saying?

What does it mean to say, for example (and relevantly to the topic of this thread about SIFTER vs BLAST) that a collection of protein sequences share a “common design”? When you say that, what are you saying about those protein sequences? How were they designed and what should we expect from the data when we analyze protein sequences that were “commonly designed”?

To the extend you are seeing a lack of test of “design” or “common design” it is only because creationists continually refuse to actually specify what that means.

BLAST implements a heuristic. It may be a simpler one than SIFTER to be sure, but if there’s some sort of creationist method of function-inference that can be employed algorithmically to predict the functions of protein sequences, please provide one.

Meanwhile, what we have is that a tree-informed method works better than one without the tree. That implies the data supports a tree better than no tree. Which should cause a curious person to wonder what a good explanation for that might be. One of those could be that there really is a tree, and that similar protein sequences share a real genealogical history of branching descent with modification.

You’re welcome to offer a better one.

Great. Contact those who adhere to design and ask them to come up with a method of “grouping together genes(or organisms) simply due to some commonality”. It’s high time. 160 years since Darwin and still nothing.

Yes, definitely. That bias is the tree. Very explicitly and intentionally designed to use the tree to bias it’s inference of function based on the functions of the closest relatives on a phylogenetic tree.

Mercer · April 2, 2021, 1:35pm

I would say that we are testing the much more discrete hypothesis that the similarity we observe was produced by common design, not common descent.

Do you see the difference?

Topic		Replies	Views
A Test of Common Descent vs. Common Function Conversation	54	2285	January 31, 2021
Gpuccio on Common Descent Conversation Science	1	750	August 26, 2019
Phylogeny - Help me see what you see Conversation Science	128	3524	February 6, 2021
Phylogeny and Incongruent Trees Conversation	153	3068	February 22, 2021
Jackson Wheat: Two Debates on Common Descent Conversation Introduction	4	646	August 10, 2020

JeffB and Swamidass: Understanding Evidence for Phylogeny

Related topics