Fascinating article that bridges between my expertise (HMMs) and @John_Harshman’s (phylogenetics). This also touches on @Kirk’s work (which is based on HMMs, but demonstrating there are relationships between different protein families (which he neglects).

Question: after it gets the distance matrix, what algorithm does this program use to build trees?

I do not know, and also I am concerned about this from a methodological point of view. Distance matrix breaks the ability to look for nested clades, right?

No, not necessarily. A least-squares fit, for example, works for phylogenetics to the extent that the distance measure represents actual patristic distances. It can even be bootstrapped. But certainly actual sequence would be preferable.

I do wonder what kind of algorithm would even be appropriate to use to infer phylogenies of large and really old protein families present in both bacteria, archaea, and eukaryota. Presumably any particular substitution model employed is likely to be inaccurate over such spans of time and diversity.

Wouldn’t that seem to rule out maximum likelihood and similar algorithms, or is it more complicated?

I do know that maximum likelihood has been used on universal phylogenies. Protein models are something I’m not all that familiar with, but I do suppose that these analyses just use some kind of symmetrical transition matrix. While transition probabilities certainly vary across life, a single matrix might be a good enough approximation.

Thanks. I don’t understand the term ‘symmetrical transition matrix’.

It just means that the model is reversible: the probability of a change from A to B is the same as the probability of a change from B to A.

Alright, simple enough. Thanks again.

In retrospect I suppose the answer to my initial question is obvious. In the absence of data that indicates particular biases in the transitions of protein evolution over such timescales, the modest assumption is to assume no bias as a first approximation, which can then later be modified if new evidence indicates the process has historically been biased in some way.

I know there are models for DNA sequences that allow for parameters to change over the tree, notably GC content. Don’t know about anything corresponding for protein sequences.

I suppose you could also do a test for tree likeness on the distant matrix too. See here:

Do you know much about these sorts of tests? How many are there? How good are they?

Not much. I haven’t worked with a distance matrix in years. I only know how to do jackknife and bootstrap, and for bootstrap you either need to assume an error distribution for cell contents (parametric method) or have lots of replicates of each cell (non-parametric method). If you have actual data, why reduce it to a distance matrix?