Winston Ewert develops his dependency graph model further

That’s what I write in the next part of the same post you are quoting from. My response to you is because you were being unclear in your writing. So to return to my original point which spurred this exchange: There is no test of the dependency relationship inferred. That is to say, it isn’t being tested whether the inferred dependency relationships are real.

Thinking about I can see there are some more complications with the idea of trying to infer a dependency relationship over common descent. Which is that the two patterns (a tree, and a dependency graph) are not actually mutually exclusive.

There are going to be some dependency relationships in real protein sequences that have some correspondence to a tree that reflects their evolutionary history.

Consider epistatic effects in the same protein sequence, where the effect of new mutations in a protein depend on prior mutations. That means some mutations that occur later, really did depend in context on mutations that occurred earlier, to be allowed in the sequence. We can see how we can get an actual tree of dependency relationships of individual mutations. One mutation can open up for a set of other mutations, so if that one mutation is being passed on along both branches of a bifurcation, and each new lineage evolves a distinct mutation each that depend on that ancestral one, we have a dependency relationship that follows a tree. Having homoplasies occur here or there by chance, or more significantly by selection just makes the whole thing worse of course.

3 Likes

That statement is misleading. He starts by saying what would be ideal for producing convergence, but then says we don’t have the ideal situation. True enough that the factors he mention would raise the probability of convergence, but it’s not that black and white. You can’t just say that because the situation is not ideal, then convergence is out the window.

Just because we don’t have the best possible situation that would most favor convergence, doesn’t mean we have no, or bad, or insufficient reasons to get convergence.

Some selection effects are stronger than others. Strongly beneficial and deleterious mutations occur in mammals, and the effects of every one of them comes in degrees. There is also a continuum between large and small populations, large and small genomes, low and high diversity.

In reality every factor varies considerably from case to case. You can’t just dismiss the entire thing with a paragraph like that. Nobody should be convinced by that kind of dichotomous thinking.

Yes it is unexpected, but it is also rare. It’s why prestin got it’s own publication and most other proteins don’t. It’s invoked rarely exactly because strong convergence of the type seen in prestin is rare and most other protein sequences do not exhibit the same pattern prestin does.

The type of incongruence discussed for prestin really is unusual, and there are biochemical explanations for why it seems to have occurred (in the articles 12 and 13 Ewert references.)

But it isn’t invoked to explain a “widespread pattern of phylogenetic incongruence”. What on Earth is Ewert talking about? Incongruent phylogenies are usually explained by incomplete lineage sorting, multiple speciation events being close in time, or loss of signal due to extreme distance of time. We also can’t consider chance homoplasies the same thing as convergence.

Another point here is something I don’t ever see ID-proponents or creationists really deal with: Degree of incongruence. Ewert cites Penny 1982:

Penny et al. (1982), in an early attempt to statistically verify common descent using genetic sequence data, wrote [3]:

The theory of evolution predicts that similar phylogenetic trees should be obtained from different sets of character data.

More recent papers do not make this prediction. They instead simply state that phylogenetic trees inferred from different genes or proteins are often in conflict [6–10].

Notice Penny et al. don’t write identical. They write similar. The degree matters. Ewert’s statement that more recent papers “don’t make this prediction” is confused because the papers he cites aren’t written to “test” that prediction at all, nor do they abandon it. But Ewert’s statement reads like that’s what has happened, that the prediction has been abandoned. It hasn’t.

Theobald dealt explicitly with this confusion in his +29 Evidences for macroevolution article

When two independently determined trees mismatch by some branches, they are called “incongruent”. In general, phylogenetic trees may be very incongruent and still match with an extremely high degree of statistical significance (Hendy et al. 1984; Penny et al. 1982; Penny and Hendy 1986; Steel and Penny 1993). Even for a phylogeny with a small number of organisms, the total number of possible trees is extremely large. For example, there are about a thousand different possible phylogenies for only six organisms; for nine organisms, there are millions of possible phylogenies; for 12 organisms, there are nearly 14 trillion different possible phylogenies (Table 1.3.1; Felsenstein 1982; Li 1997, p. 102). Thus, the probability of finding two similar trees by chance via two independent methods is extremely small in most cases. In fact, two different trees of 16 organisms that mismatch by as many as 10 branches still match with high statistical significance (Hendy et al. 1984, Table 4; Steel and Penny 1993). For more information on the statistical significance of trees that do not match exactly, see “Statistics of Incongruent Phylogenetic Trees”.

The stunning degree of match between even the most incongruent phylogenetic trees found in the biological literature is widely unappreciated, mainly because most people (including many biologists) are unaware of the mathematics involved (Bryant et al. 2002; Penny et al. 1982; Penny and Hendy 1986). Penny and Hendy have performed a series of detailed statistical analyses of the significance of incongruent phylogenetic trees, and here is their conclusion:

"Biologists seem to seek the ‘The One Tree’ and appear not to be satisfied by a range of options. However, there is no logical difficulty in having a range of trees. There are 34,459,425 possible [unrooted] trees for 11 taxa (Penny et al. 1982), and to reduce this to the order of 10-50 trees is analogous to an accuracy of measurement of approximately one part in 10^6." (Penny and Hendy 1986, p. 414):

Ewert also has this confused statement in the abstract:

Prestin sequences from some echolocating bats show similarities with prestin sequences from echolocating whales. Conventional analyses interpret this as convergence, not because convergence is known to be evolutionarily feasible, but because this preserves the presumed phylogenetic tree.

Ehh no, conventional analyses infer convergence because phylogenetic trees inferred from alignments of multiple concatenated protein sequences produces the conventional tree where bats are correctly grouped together, and where dolphins are among the whales as they should be. And because, in fact, convergence is known to be evolutionarily feasible in many cases. It is against this background where most protein sequences reflect the canonical phylogeny that the prestin phylogeny stands out. It’s not that anyone is trying to “preserve the presumed phylogeny”, it’s that when more data is included that is the overwhelming signal in the data and thus the topology the algorithm spits out.

4 Likes

Relationships have neither structures nor topologies.

No, we still expect it, because the timescales are much longer. Funny how Ewert omitted time as a factor there.

1 Like

True, but a requirement for prealignment presents major issues too.

Hmmm. How does it know which sections of the string are indels?

Those are marked in the sequence strings as hyphens by the alignment software.

1 Like

Thanks.

That’s such a absurd sticking point among these people. To them all phylogeny analyses must always give “perfect” statistical support for one topology and one topology only.

If they apply this thinking to statistics as a whole, they would reject any result that is affected by confounding variables or random error (basically ALL of statistics). For example, It’s like seeing trend line on a scatter plot with an R^2 of 0.999 and a P-value of <<0.001 and dismissing this because it’s not a 100% match, that almost no data points are EXACTLY right on the trend line. It’s completely absurd.

4 Likes

The settings (specifically, gap penalty) one uses in the alignment can easily lead to weird results; for example, the two sequences one is aligning may be produced by different splicing patterns. That produces a huge gap that is not an indel. I wouldn’t trust anyone in the ID movement to do that right or to detect such a problem.

1 Like

This seems unlikely to happen. Protein sequences are generally not determined by sequencing proteins but by sequencing the genes, removing the introns and other untranslated bits. Generally one doesn’t leave out any exons. Protein variants can be found in databases but they’re usually labeled. And anyway, I think the creationists would be likely to use protein alignments they found on the web rather than aligning them themselves. And Ewert is probably not comparing indels either.

1 Like

For what it is worth, the prestin alignment for the species in question is pretty straightforward. The small (3-4 residues) differences in length are readily resolved by putting a single small indel at the ends where needed. And the final few positions are not critical to the assessment; I’d expect an alignment that was truncated to the ~730 positions present in all sequences would yield the same result.

2 Likes

Forgot to include this reference in a previous post, that shows some times strong convergence actually happens in experiments:
https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/syst.202300006#syst202300006-bib-0029

The emergence of the canonical ATP aptamer recognition loop motif in several separate selections (from distinct seed sequences) suggests that it may constitute privileged molecular solution for ATP binding. Indeed, it has been previously discovered multiple times in independent selection experiments including for binding to ATP containing cofactors such as NAD14a and SAM14b, 26 and was furthermore identified in several bacterial and eukaryotic genomes.15, 17b This parallels the case of the Hammerhead ribozyme motif27 and suggests that both of these motifs may represent minimal optimal solutions in RNA sequence space. The isolation of the canonical ATP aptamer motif in the T5 selection may also indicate its resilience to high mutation rates.

1 Like

None of the reasons you give are valid.

For starters, I didn’t claim they were, and generally, you’re wrong. They are far more often determined by sequencing RNAs, not genes.

Then you’d miss any alternative splicing. Do you not realize that it leaves out exons?

Then that would be a ridiculous way to do it, because not all exons are present in every transcript. Again, you’re denying the very existence of alternative splicing.

For the well-characterized organisms. Not for all. But that has nothing to do with your point.

Hard to tell, since there’s no relevant information provided that I can see.

It was, however, trivial to find prestin for one of the organisms mentioned:
https://www.ncbi.nlm.nih.gov/protein/24666186

And where did it come from? Protein sequences are not yet (and may never be) simply derived from genomic sequences, but are still from mRNAs:

/note="Derived by automated computational analysis using gene prediction method: Gnomon. Supporting evidence includes similarity to: 21 mRNAs, 45 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"

Any claim to know a protein sequence from the genomic sequence alone is suspect. Only minor differences can be reliably inferred.

It sounds very techy. :grin:

I don’t think I’ve ever heard “graph” as a modifier of “model” in real science, either.

I have not had a chance to review this yet but I’m glad to see he continues the work on this.

2 Likes

Graphs are often used to represent models. In real science.

Alternative splicing is rare. Most of the variant processed RNAs detected are splicing errors. I have no idea whether prestin has any functional alternative splicing in any of the species included, but it doesn’t seem likely.

Not denying, quite. But very few splice variants would seem to be functional. Only a handful of genes have functional variants.

Why would this be so? There are clear signatures of intron boundaries. I’ve sequenced a lot of introns, and I didn’t need to sequence any RNAs to tell me where the intron starts and the exon ends. Note that the protein sequence in question was in fact derived from the genomic sequence, though with help for annotation from both protein and mRNA data.

Is it? How rare?

That doesn’t really tell us how rare functional alternative splicing is.

That likelihood would depend on your definition of “rare.”

Pretty much, yes.

That’s not the relevant number and “very few” is not a number; the relevant number is the proportion of genes with functional splice variants, which you seem to be dancing around.

“Handful” is not very quantitative. The lowest estimates I’ve seen are ~25% of mammalian genes. I don’t call 5000 a “handful.”

Because of alternative splicing.

But you do need RNAs to tell you which exons are included and excluded in which transcripts.

Yes, that’s the evidentiary basis of my point quoted above, but I wouldn’t denigrate it as mere “help.”

AFAIK, no one has ever claimed to have made an estimate of the proportion of functionally alternatively spliced genes from genomic data alone. Do you know of any such estimate?

I didn’t say anything about using a graph as a representation of a model. He’s using it to describe the model itself, which makes no sense.

Couple of things. First, on the rarity of alternative splicing: Sandwalk: Alternative splicing: function vs noise

In his book What’s in Your Genome?, Larry puts the figure at around 5% of human genes having one or more (almost always one) alternative forms. He uses as his source Bhuiyan et al. 2018. Systematic evaluation of isoform function in literature reports of alternative splicing. BMC Genomics 19:637.

But why are we interested in this? Do you think that Ewert’s alignment is compromised by undetected isoforms? As far as I know, he’s looking only at amino acid substitutions.