JeffB and Swamidass: Understanding Evidence for Phylogeny

The process of identifying function is indeed experimental and therefore laborious (or as I would call it, fun). Annotation is not that process. Annotation is the assignment of function to a protein or domain. The phylogenomic process they describe is IMO at the interface of those two things: phylogenomic analysis can assign a function to a protein or a domain through inference. This was laborious a decade and a half ago when the paper was written, and perhaps it still is, but it isn’t/wasn’t experimental in the sense that a biochemical or cell biological experiment can demonstrate that a particular protein domain is a kinase domain or that a particular protein is a transcription factor.

They can be and sometimes are. I think most biologists here will agree with me that a claim that a protein domain is (for example) an enzymatically functional kinase is a claim that is straightforwardly verifiable experimentally. There is usually no barrier in principle to verifying the existence of a particular protein function. The barrier will likely be related to cost-benefit, i.e. whether the effort of making the protein and testing its function is worth it in the specific context.

I hope that helps; I’m not always sure of what you are asking or why.

3 Likes

The sequencing of new genomes is far outpacing our ability to determine function “the old fashion way”. Annotations mostly done in silico, so there is something to be said for algorithms that take phylogenetics into account.

I will definitely agree with those sentiments. In the work I am familiar with, the only reason to confirm an in silico or weakly evidenced annotation is if the gene became important for a specific hypothesis. For example, if you want to confirm that a protein is a kinase and that it phosphorylates the proposed target if that kinase is differentially expressed between controls and an experimental group, and even then it would probably need to sit at an important junction in a pathway analysis. MicroRNA’s are another good example where you would need to confirm that a specific miRNA downregulates the targets it is predicted to downregulate. Most annotations outside of a few heavily studied model organisms are always understood to be educated guesses.

2 Likes

Ultimately, yes. Experimental evidence for the functions of protein sequences is the gold standard. But it is expensive and time-consuming to do all the sorts of biochemical experiments necessary to conclusively answer what the function of some given protein is.

So it is significantly easier and cheaper to get a computer algorithm to predict (effectively make an informed guess) the function without having to do a lot of complicated biochemical experiments.

But algorithms can be wrong, of course, when they calculate their predictions, so they are ultimately always less reliable than direct experimental characterization. So there’s going to be some trade off between time and money saved relying on an algorithm instead of doing those expensive and time-consuming experiments.

So now here the question is which algorithm is best, how much better is it than the competitor we are comparing it to, and what explains it’s greater success?

Yep.

1 Like

So @jeffb Id differ in my answer from them in some of the nuance.

In silico experiments are a type of experiment too, and that is how this was verified. Likewise, I’m pretty sure the annotations are not of domains in this case, but of protiens.

The initial data was annotated by curating data from the literature that reports wet lab experiments. A hold out procedure was used to test each method:

  1. Some of the function labels were randomly selected and “held out”

  2. The algorithms were run on the proteins on which the label was held out, to see if the algorithm could guess the missing label.

  3. The accuracy in guessing the missing label was recorded and reported.

There are some nuances here. At times they might rotate the hold outset through the whole dataset (Crossfold validation). I’d have to reread the papers to know the exact details. But the basic idea is always the same.

We delete part of the data where we know the answer to see if the algorithm can fill it in. That tell us how well it can fill in the missing values for the cases we don’t have the answer.

Comparing performance in this test is an experimental validation.

The actual training labels were curated from the literature and summarize 10s of thousands of wet lab experiments.

3 Likes

Question: How was it determined that SIFTER is twice as good as BLAST in determining protein function?

The original papers make that clear as does my OP. There many validations done, but I believe I showed a precision/recall curve, that showed about twice the recall for equivalent precision.

1 Like

That was entirely opaque to me. I suppose it must be some kind of bioinformatics jargon that I don’t know.

It’s just basic statistics:

https://acutecaretesting.org/en/articles/precision-recall-curves-what-are-they-and-how-are-they-used

1 Like

For the benefit of any non-specialists who may read this, you really should provide some kind of explanation here, not just a link. And you should relate it directly to the actual tests performed, not a general discussion of disease diagnosis. This seems elementary courtesy. I’m certainly willing to describe common methods in phylogenetics for the layman and have done so here on many occasions.

@John_Harshman you are well qualified to do this yourself, or just quote what I wrote in the OP:

See that the SIFTER curve is higher than the BLAST curve? That means it is performing better. The area under the SIFTER curve is about 2x more, so it’s about 2x more accurate.

Curve on the left is a pROC curve (a ROC curve plotted semilog), and curve on the right is a PR curve.

That 2x is very much an eyeballing, not an exact number.

That’s true but barely. They discuss multi-domain proteins specifically and had to choose one function (GO term) for each protein (at least as of 16 years ago when this was published). Note this from the paper:

This degradation trend in prediction quality highlights a problem with function annotation methods and their application to multifunction or multi-domain proteins [16,53]. SIFTER in particular appears prone to this degradation, which may be addressed in part by a more problem-specific decision rule that selects function predictions from posterior probabilities, although ultimately the statistical model for SIFTER could explicitly take protein domain architecture into account.

More to the point, though, annotation is a process of assigning functions (and other characteristics) to sequences. That was apparently not clear to @jeffb, nor I suspect will most of your technical discussion be clear. I’ll leave now but I wonder if it would be better to explain things at more lay levels.

2 Likes

You might be right. It’s hard for me to parse out the points of confusion because it seems so clear to me. So it helps when you point out the blind spots (@John_Harshman , @sfmatheson , and @jeffb ).

I am thinking a bit about doing a full fledged article on this…maybe I’d get it right there.

1 Like

Assuming that I manage to find time at some point, I could probably write up a genome annotation tutorial. Wouldn’t be for a while, though. I still need to finish my variant calling tutorial.

2 Likes

What you wrote in the OP is not clear. These terms “precision”, “recall”, pROC, ROC, PR, TP, FN, FP, and TN may indeed be familar in bioinformatics or perhaps epidemiology, but they aren’t familiar to me. What do these curves actually mean?

Further, what is being measured? Correct predictions of function? If so, how is it determined that the predictions are correct? And so on.

From the abstract of the paper:

They queried sequences with empirically determined function and SIFTER accurately predicted function 96% of the time compared to 11-89% accuracy for the other algorithms tested. BLAST was 75% accurate.

3 Likes

The fog has lifted. Yes, now this makes way more sense. SIFTER and BLAST computational results were compared to known wet-lab results. I know that simple concept may have been obvious to the rest of you, but that was something I only started gleaning after reading through the supporting links. And then it helped to have you all verify my assumption.

So as the self-appointed spokesman for the ‘novice’ crowd in this thread, I concur with @John_Harshman &
@sfmatheson. Clarifying that point helps a lot.

4 Likes

That’s correct. Wet-lab results are the gold standard which is why they used that standard in the initial paper for SIFTER.

1 Like

So if I now understand, they use a database of protein functions, give the algorithms a small sample from that database, and test their ability to predict the contents of the rest of that database. Is that it?

1 Like

I read it more like they take a sample of proteins with known (biochemically characterized)functions, then erase information about the function for some subset of the proteins in that sample. Then for that subsample with their function-information erased, they ask the algorithm to predict the functions on the basis of the function of the remainder with known functions.

And here the tree-based algorithms are substantially better than the similarity-based algorithms where for example SIFTER apparently correctly infers the function 96% of the time, compared to blast which is 75%.

4 Likes

That seems to be exactly what I said. Isn’t it?