I wanted to read this paper because this is the first case I’ve come across where structural alignments were used to infer ancestor states in the protein, because the proteins are so distantly related in sequence that several sites were inferred ambiguously.
Ancestral sequence inference. We inferred three ancestral nodes by maximum likelihood10: the most probable ancestor of all chalcone isomerases (ancCHI), of all CHI-like proteins (ancCHIL), and the CHI/CHIL common ancestor (ancCC). Given the wide divergence between CHIs and FAPs, an earlier ancestor was not inferred. Details of the procedure and prediction statistics are provided in the Supplementary Information and in Methods. Briefly, because protein sequence divergence between CHI, CHIL, and FAPs is high, and includes insertions and deletions (InDels), we generated a structure-based alignment (Supplementary Table 1, Supplementary Fig. 1, and Supplementary Dataset 1). No systematic InDels were found between the CHI and CHIL lineages. Hence, the structural alignment was trimmed in loop regions and at the N and C termini and a phylogenetic tree was generated (Fig. 1b; see Supplementary Fig. 2 for the complete tree). Remaining gaps and ambiguously aligned positions were handled manually in the reconstructed ancestors (Supplementary Dataset 2 and Supplementary Fig. 3).
Check out this legend to supplementary figure 3:
Supplementary Figure 3. An advanced guide to ancestral sequence inference with no headaches (or fewer, at least): representative example of the decision making process to determine the amino acid sequence in ambiguously inferred positions. Due to the high level of divergence within and between the three protein families, sequence-based alignments give ambiguous and inconsistent results, i.e. different programs place gaps differently. We therefore performed a structure-based alignment (Armougom, F. et al., Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res 34, W604-608 (2006)), which preferentially places gaps in loop regions and improved the quality of the alignment. The next challenge was that by default, ancestral inference places an amino acid at each position in the alignment, even if it consists mostly of gaps. In other words, the underlying model assumes that the ancestor was of maximum length and that every gap is a deletion, while in reality the opposite is often true. Therefore, we inspected all ambiguously inferred positions manually and corrected them to the best of our knowledge. Our approach was that even if our decision-making was flawed, this should have little or no effect on ancestral protein structure and function because indels typically occur in loop regions of high divergence. In total, eleven loop regions were manually revised as illustrated in this figure. First, the structure-based alignment (trimmed only at the termini) was used to determine the consensus number of amino acids - three in the example at hand. Second, the most probable ancestral sequences were added to the trimmed alignment (that was used for generation of the phylogenetic tree and ancestral inference) and reduced to the consensus number of amino acids. In cases where a particular amino acid could not be decided on (in the example, both S and G in the third position of seem plausible), additional information such as phylogeny, structural information, and chemical intuition were used to make the decision. In the above example, the structural context in a helix kink led us to choose G due to its frequent occurrence in turns. Additionally, N- and C-terminal adaptor sequences were added to all ancestors as shown in Supplementary Figure 1.
Other (to me) Interesting facts about this article is that they find not just multiple pathways by which functional descendant enzyme states could have evolved from the inferred ancestor, but multiple possible ancestral starting points could have given rise enzyme activity too.
Overall, the above results reinforce the conclusion of facile emergence of CHI despite its origin from a catalytically inactive ancestor: multiple founder mutations and subsequent trajectories are available with unexpectedly weak functional epistasis. In other words, the evolutionary landscape underlying CHI’s emergence is smooth rather than rugged.