Winston Ewert develops his dependency graph model further

2 Likes

Yup, I’ll say he is definitely digging further.

Calling anything in bio-complexity ‘peer reviewed’ is a laugh.

3 Likes

So no actual test of the dependency graph. No relationship of dependency is demonstrated or tested. Applied to the protein sequence he chose, he doesn’t show any of the amino acid residues that are different and that he orders into a dependency relationship, actually depend on each other.

And then this outright obvious falsehood:

In contrast, the standard evolutionary framework provides no insight into how echolocation is implemented in different mammals. The commonly accepted evolutionary tree requires echolocation to have evolved independently three times as illustrated in Ewert (2023), Figure 1. Yet the maximum possible time for the evolution of a fully aquatic mammal or a bat is insufficient for more than two coordinated mutations to appear. Any evolutionary scenario for echolocation would require far more than two coordinated mutations (here,here), so evolution fails to explain this trait’s origin even once, let alone multiple times.

3 Likes

And there’s still no actual ideal dependency graph. Ewert doesn’t predict a dependency graph as it would ideally look. All he does it provide some rationalization for why there isn’t a perfect tree, by positing some ad-hoc explanation that things are different in different species because the way in which it’s different is because that’s best for them. Again, no actual test is done.

It’s hard to imagine how the data could fail to be consistent with an explanation that consists entirely of not being a perfect tree combined with it’s the way it is because that’s best.

Is there some statistical measure or test of of dependency-graph-ism that isn’t just not being a perfect tree?

5 Likes

Ewert’ s new paper is about AminoGraph, a tool he developed to infer the relationship/topology between aligned protein sequences. And contra your claim, he did test AminoGraph, as Miller explains below :

*To visualize the interrelationships, Ewert created a program called AminoGraph that identifies modules as specific amino acid alterations from an archetypal sequence that is a representative version of the protein. One module depends on another if it includes the latter’s alterations plus additional alterations. The program takes as input amino acid sequences for the same protein in different species, and it creates from this data the most consistent graph of module dependences. *

The program displays sets of sequences that are not related as each input sequence directly linking to the archetypal sequence (Figure 2a). It displays data that best fits a common ancestry model as a standard evolutionary tree (Figure 2b), and it displays data that corresponds to modules with complex interrelationships as a dependency graph (Figure 2c). Ewert tested the program on simulated data corresponding to unrelated sequences, sequences connected by common ancestry, and sequences connected through a specific dependency graph. In each case, the program properly identified the correct structure and accurately identified most of the modules and their interrelationships, thus confirming the reliability of the program.

I’ve some difficulties to understand what you mean here.

This is far from being an obvious falsehood

Yes, there is. See appendix 7 of Ewert’s new paper

I got curious and saw that the AminoGraph software is available, so I thought I’d see how far I could get with it.

First I went to NCBI Protein database and used BLAST to get the prestin protein sequence for all the species in the paper (except for Catagonus wagneri which was not available). I downloaded the multiple sequence alignment provided; I figured the proteins were similar enough that the alignment shouldn’t need a lot of fine tuning.

Then cargo install aminograph worked just fine. (But O my, are there a lot of dependencies I’d like to see the dependency graph for the software itself.)

aminograph aligned-fasta.fa output-directory presented a small hiccup. It turns out there are multiple commands that AminoGraph supports. After a quick review of the provided help, I opted for aminograph infer aligned-fasta.fa output-directory. That generated some output about initial probabilities before crashing on an index out of bounds. Maybe I need to double check that my FASTA actually contains an MSA with equal length sequences and not just a set of sequences. Beyond that, I’m not sure how to proceed, especially since the link to the github repository for the source yields a 404 (not that I’m fluent in Rust anyway).

Anyone else try and get any further? Ultimately I was thinking of trying it out on output from different simulations. Do the results scale directly with the number of homoplasies or is there a more complex relationship? Does it show any ability to distinguish between homoplasies from incomplete lineage sorting vs functional convergence?

Similar to @Rumraket’s concerns, I’m unclear on why the analogy of a dependency graph is being invoked specifically. What is described sounds more like a series of revisions or patches, which can also form a graph. Of course, I suppose those can also have a dependency relationship, in the sense that they need to be applied in a specific order or the later patches won’t have the intended result. That then makes me think of epistasis which I expect would also induce a directed graph on series of changes. I wonder to what extent that could explain the results.

1 Like

No. A nest of the dependency relationship would require showing, functionally (that means physically/biochemically), that the “modules” inferred from the data really do, physically, depend on each other in the way it is depicted in the dependency graph. That is to say that this one part of the protein really does depend on this other specific part of the protein to have that exact sequence to perform that function. That would be a test of the inferred dependency relationship.

That not even two “coordinated” mutations can evolve in millions of years? To arrive at that conclusion you must be going through some impressive mental gymnastics.

What is meant by a coordinated set other than that the phenotypic effect of one depends on the other(?)
Are they retrospectively considered to be pre-specified (committing the Texas sharpshooter fallacy)?
What is this imagined-to-be-coordinated set of mutations that would have had to evolve in the origin of aquatic mammals, or bats, anyway?

Ehh that seems to describe merely an attempt to find the best dependency graph that fits the data, not a test of whether there is one in the data or not, or how well it is supported.

5 Likes

Yep, I did need to do that. I’ve now got an actual MSA and aminograph infer is cranking away. It should take about 24 hours to complete the default 1000 iterations. If the end results are something I can make sense of, I’ll try some other simulated data and see what the results are. At that speed, it’ll probably take a few days at least to (hopefully) get something more interesting than a repeat of the analysis in the paper.

4 Likes

(Apologies if the following is redundant after moderation reveals other replies and/or if I’m stepping on toes.)

I started down this road a little bit earlier, but I want to take another pass now that I’ve thought about it some more. What Miller goes on to describe sounds more like software revisions and patches–sequential change sets to be applied to transform a given code file from one version to another–than software dependency, where separate code modules refer to each other. (A closer molecular biological analogy to the latter might be protein interaction networks or metabolic pathways.) So let’s elaborate on the analogy using the language of software version control.

Assuming the AminoGraph github link gets sorted out (and assuming the licensing makes this legally permissible), you or I or anyone can use the underlying git version control tool to fork the AminoGraph code, meaning make a separate copy which can be edited without impacting the original. For example, suppose I wanted to modify it so that instead of just crashing with an index-out-of-bounds error, it catches that error and provides the user with a gentle reminder to check whether their sequences are all the same length. I could make that change in my forked version while Ewert’s version retains the current behavior.

Then suppose someone liked my error handling, and they also wanted an additional feature. They could fork my code and add their feature. Meanwhile, suppose maybe Miller wants yet a different feature and so he forks Ewert’s original code and makes his changes. And so on and so forth. Eventually we’d get a bunch of different versions of AminoGraph. And git is keeping a record of the changes made to get to each of those versions so that it can undo them and redo them. So for any given forked version, it could start with the original code and apply a series of changes in order to produce the current version. To work properly, each of those changes would require all the previous ones.

In the story I’ve told so far, we could represent the process with a tree structure, where each fork operation splits off a new limb. (I’m deliberating avoid the ‘b’ word because it has a specific meaning in git.) Crucially, if things proceeded as we’ve described so far, then the changes in any given limb would only depend on previous changes from its own limb and “parent” limbs, but not any “sibling” or “cousin” limbs. In other words, a graph of the requirements of the change sets would form a tree, the same tree represented by the forking processes.

Hopefully it is apparent that this could also be a description of speciation, where instead of code and developers we’ve got genomes that get forked by different populations to form separate species. The change sets there can also be represented by a tree.

Now, git has other features, including the ability to create a pull request. Suppose I think my error handling and reminder about sequence lengths would be a good addition to the original AminoGraph, or to your forked version, or Miller’s, etc. I could create a pull request which would be a request to pull my changes from one limb to another. Assuming it is accepted, we’ve now got a sequence of changes that goes between “sibling” or “cousin” limbs, creating a graph that is no longer a tree but instead a more general directed acyclic graph (a category of which trees are a subset).

(The biological equivalent of a pull request is horizontal gene transfer, which does occur but which I doubt many would invoke in the specific case of the prestin protein from bats and whales.)

Once again, git can keep track of all of this and actually show you the graphs. But imagine we’ve lost all the git metadata and just have the source code of the different versions of AminoGraph. We can look at the differences between the code and attempt to reconstruct the graph. If we only knew about the fork operation, we’d attempt to construct a tree. After we do, we notice some error handling show up on multiple limbs. Knowing my skills, it is probably very simple code, so it is probably plausible that different developers separately made the same change. If we then learned about the pull request operation, we might suggest that could explain the code we see and draw a different graph.

And that’s the situation we have with the prestin protein. Some differences in the proteins follow the speciation tree that is generally agreed upon, with all bats grouped more closely to other bats and all cetaceans grouped more closely to other cetaceans. And some differences don’t follow that tree and instead group with the echolocators and non-echolocators. In biology these are homoplasies, and the proposed explanation in the prestin case (without pull requests or horizontal gene transfer) is convergent evolution, meaning the same changes occurred independently in separate lineages. But Ewert seems to be suggesting that maybe there is a pull request mechanism–not horizontal gene transfer, something else, possibly not a molecular biological mechanism at all–which explains the data.

Getting back to the original point, I think there are two related questions. Firstly, does a relationship described by the particular non-tree, non-star directed acyclic graphs inferred from some of the data sets imply the specific relationship of dependency, or is the dependency concept something which is being added onto the graphs inferred from the data because dependency graphs can also have the same structure? On that point, I’d note that software dependencies and revision/patch relationships can take the form of tree graphs and star graphs. Secondly, do homoplasies in the data always suggest the same types of directed acyclic graphs, or do they follow different patterns which induce distinguishable graphs depending on the mechanisms that produce them (e.g. incomplete lineage sorting, functional convergence, external pull request type copying)? I have an idea what the answer to the second question is, but hopefully I can figure out how to test it with some simulations.

6 Likes

Sorry but Ewert did test the ability of AminoGraph to correctly identify the correct structure/topology of aligned « artificial » protein sequences. This is an important check before using the tool for questioning the structure/topology of real protein sequences. It is only at this stage, when AminoGraph infers a dependency graph for real proteins that you can try to find physical/biochemical evidence supporting the inferred relationship. IOW, AminoGraph produces hypotheses regarding topology/structure that can then be tested further. Not so bad for a beginning. Now, you ask for some biological validation for the output of AminoGraph, but do you ask the same thing for convergence? IOW, when evolutionists infer convergence, do they biologically test their hypothesis?

Trying on various hats:

As a software engineer, this tells me that Winston Ewert’s code is poorly written and hasn’t been properly tested.

As a science major, this tells me that AminoGraph won’t be much use, as very few genetic sequences will be the same length in a variety of organisms.

As both, I’m wondering how AminoGraph could possibly determine a (shared) dependency based only on genome or protein sequences. Software modules with shared dependencies don’t have to have a lot in common, and may have virtually nothing in common.[1]


  1. The shared features may literally consist of the two strings “import mod” and “mod.” While that can be identified by a source code static analyser, it’s done by looking for known features, in a way that isn’t applicable to genomes or proteins. ↩

3 Likes

SARS-CoV-2 has independent lineages converge on the exact same mutations to antigenic regions. Does that count?

2 Likes

What does that mean, “correct” structure/topology? What is correctness here and why would you question the structure of a protein sequence? A protein structure is usually measured by various physical measurements, such as x-ray crystallography, NMR, Circular Dichroism, or whatever. That’s how you get an atomic-scale model of what structure the protein adopts (at least under the conditions that the measurement takes place.) Why would you question that with AminoGraph?

I think you have misunderstood what Ewert is doing.

I see in the paper the “structure” you speak about is not the protein’s structure, but the inferred structure in the data. The pattern thought to be best characterized as a dependency graph. This has nothing to do with the structure of the protein (or any protein’s sequence) but the pattern in the similarities and differences between shared similar protein sequences in the alignment.

While technically very challenging, yes. Now they have certainly only tested a miniscule, tiny minority of all known examples where convergence is inferred, but they do test them when they can. Here are some examples:

Those are tests both of inferred examples of historical convergence, and tests of whether strong selection under particular conditions can also favor convergence. Some times the tests fail in that sequences thought to have independently converged under selection on a common solution, are found to have numerous other just as functional non-convergent possibilities. Which undermines convergence as the explanation for similarity, in which case common descent is inferred instead.

In other tests, proteins are directly evolved under similar constraints, both from a common ancestral starting point, and from entirely dissimilar starting point, to see whether and to what degree they will converge. Some times they do, but more often they don’t. That shows that convergence is a real possibility, but expected to be rare. That is to say, there is reason to find convergence to be possible, but generally more unexpected the stronger it is.

There are numerous other ways in which convergence by selection has been demonstrated inadvertently.

Ewert’s discussion of the possibility of convergence on protein evolution is superficial and dismissive in a way I find misleading.

1 Like

As a biologist who writes code, while I recognize that the error handling could be better, the issues I encountered are typical of research software.

And to clarify, while you are right that genes can vary in length between organisms, the sequence lengths in a multiple sequence alignment should all be the same, with insertions and deletions included to make everything come out equal. Once I discovered that I had in fact made an error with the file I used as input, things went much smoother.

The report.html that it generates is actually a fairly slick interactive page that shows the computed graph and the sequences of all the inferred nodes and a variety of other information I’m still exploring.

3 Likes

Thanks for this very illuminating piece.

Most alignments are not included, as others have noted that his sequences must be the same length. That alone makes the whole thing a waste of time.

Of identical length, no?

No, people produce hypotheses and inferences. And we can bet that neither Ewert nor anyone else in the ID movement will ever test an ID hypothesis.

By looking at sequences representing intermediates, of course. The hypothesis makes very clear predictions about them.

So you know more about genetics, virology, and now software engineering than geneticists, virologists, and software engineers. How do you manage that, Gil?

O dear. I have really created a lot of confusion.

Trying again - AminoGraph expects the input multiple sequence alignment to be a properly constructed multiple sequence alignment. I thought that was what I had, but I subsequently found an issue and corrected it.

It is not limited to analyzing proteins which natively all have the same number of residues; it just needs them to be aligned first with indels as necessary for each sequence string to have the same number of characters.

2 Likes

The structure/topology I am referring to has nothing to do with the structure of an individual protein but to the structure/topology of the relationships holding together aligned amino acid sequences. AminoGraph is a tool devised to uncover the structure/topology of these relationships. This should be obvious to anyone having read Ewert’s paper.

Yes, it certainly counts. Note that in his paper, Ewert recognizes this point. Here is what he said:
Convergent evolution due to natural selection is un- doubtedly a real process that explains some biological similarities. For example, convergent evolution has been observed in the ongoing evolution of SARS-CoV-2 [19]. However, this is the ideal circumstance to enable con- vergent evolution: an enormous population size, small genome, high uniformity, and large selection effects. In the case of the evolution of complex lifeforms, such as mammals, we have small populations, large genomes, high diversity and small selection effects. It is unex- pected that convergent evolution would apply to these situations. Indeed, the papers which published the molec- ular convergence in prestin describe it as surprising or unexpected [12, 13]. They invoke convergent evolution not because it is an expected outcome but because it the only remaining evolutionary option. Convergent evo- lution does not seem a viable account of a widespread pattern of conflicting phylogenetic signals.

1 Like

@AndyWalsh at 15 seems to disagree with you on this

1 Like