Merging HMM with Phylogenies

swamidass · January 24, 2019, 9:17pm

Fascinating article that bridges between my expertise (HMMs) and @John_Harshman’s (phylogenetics). This also touches on @Kirk’s work (which is based on HMMs, but demonstrating there are relationships between different protein families (which he neglects).

John_Harshman · January 24, 2019, 11:05pm

Question: after it gets the distance matrix, what algorithm does this program use to build trees?

swamidass · January 24, 2019, 11:08pm

I do not know, and also I am concerned about this from a methodological point of view. Distance matrix breaks the ability to look for nested clades, right?

John_Harshman · January 24, 2019, 11:16pm

No, not necessarily. A least-squares fit, for example, works for phylogenetics to the extent that the distance measure represents actual patristic distances. It can even be bootstrapped. But certainly actual sequence would be preferable.

Rumraket · January 24, 2019, 11:20pm

I do wonder what kind of algorithm would even be appropriate to use to infer phylogenies of large and really old protein families present in both bacteria, archaea, and eukaryota. Presumably any particular substitution model employed is likely to be inaccurate over such spans of time and diversity.

Wouldn’t that seem to rule out maximum likelihood and similar algorithms, or is it more complicated?

John_Harshman · January 24, 2019, 11:30pm

I do know that maximum likelihood has been used on universal phylogenies. Protein models are something I’m not all that familiar with, but I do suppose that these analyses just use some kind of symmetrical transition matrix. While transition probabilities certainly vary across life, a single matrix might be a good enough approximation.

Rumraket · January 24, 2019, 11:39pm

Thanks. I don’t understand the term ‘symmetrical transition matrix’.

John_Harshman · January 24, 2019, 11:44pm

It just means that the model is reversible: the probability of a change from A to B is the same as the probability of a change from B to A.

Rumraket · January 24, 2019, 11:45pm

Alright, simple enough. Thanks again.

Rumraket · January 24, 2019, 11:51pm

In retrospect I suppose the answer to my initial question is obvious. In the absence of data that indicates particular biases in the transitions of protein evolution over such timescales, the modest assumption is to assume no bias as a first approximation, which can then later be modified if new evidence indicates the process has historically been biased in some way.

John_Harshman · January 25, 2019, 1:56am

I know there are models for DNA sequences that allow for parameters to change over the tree, notably GC content. Don’t know about anything corresponding for protein sequences.

swamidass · January 25, 2019, 2:58am

I suppose you could also do a test for tree likeness on the distant matrix too. See here:

Do you know much about these sorts of tests? How many are there? How good are they?

John_Harshman · January 25, 2019, 4:19am

Not much. I haven’t worked with a distance matrix in years. I only know how to do jackknife and bootstrap, and for bootstrap you either need to assume an error distribution for cell contents (parametric method) or have lots of replicates of each cell (non-parametric method). If you have actual data, why reduce it to a distance matrix?

Topic		Replies	Views
Joe Felsenstein talks to Casey Dunn about developing phylogenetic methods Conversation Science	16	573	February 11, 2021
Cornelius Hunter: Arguments Against Common Descent Office Hours Design	6	1742	July 24, 2018
JeffB and Swamidass: Understanding Evidence for Phylogeny Conversation Science	88	2992	May 8, 2021
Holloway: Fallacy of the Phylogenetic Signal: Nucleotide Level Conversation Science , Design	28	566	October 3, 2020
A Test of Common Descent vs. Common Function Conversation	54	2144	January 31, 2021

Merging HMM with Phylogenies

Related Topics