JeffB and Swamidass: Understanding Evidence for Phylogeny

jeffb · February 8, 2021, 3:30am

Greetings all, wanting to follow up on my original post: Phylogeny - Help me see what you see.

First off, thank you for those who have replied. It has been fruitful. So far I’ve already gained some of that “seeing what you see” that I needed. I mentioned this before, and am willing to repeat: I see that there was more evidence for common descent than what I was previously privy to.

I’ve got a few other questions though, so I’ll call this is part 1(a).

The first reply I received was from @swamidass: His testimony (which I’ve listened two twice, and taken notes).
And also this post:

A Test of Common Descent vs. Common Function Conversation

This post is reposted from BioLogos, where I first wrote it Jan 2017. It is a direct test of common descent that appears in the literature. Here, we can test two alternate hypothesis: Similar proteins share a common history (common descent), and functions are gained and lost along this history. Proteins are similar due to common function, not because of any common history, and this creates the illusion of common history. At a genetic level we see a great deal of similarity between organisms. This similarity is a fundamental feature of life, that has to be explained by any theory. Evolution explains that this similarity as largely caused by shared history, by way of common descent. We see function often aligning in sensible ways on phylogenetic trees of sequences and organisms. Of course, this is not perfect, because we also know that life changes over time. As the YEC Walter ReMine accurately points out in The Biotic Message, function will not always follow a nested clade pa…

I read through it, and have a question. (I posted this in the previous thread, but didn’t hear a response)

To ensure I’m understanding this correctly, would this be similar to homology/homoplasy? Where SIFTER would equate to homology and BLAST to homoplasy?
Is that a valid assessment?

swamidass · February 8, 2021, 3:37am

Sorry. I missed it.

No. Sorry.

Homoplasy is similarity that is NOT consistent with a tree, and therefore NOT well explained by common ancestry alone. These are the exception to the rule of phylogeny.

Homology is similarity that is consistent with a tree, and therefore well explained by common ancestry alone. These are the rule.

Both Homoplasy and Homology are types of similarity. If common descent were true, we would expect to see mainly Homology, with some Homoplasy, and that is what we see.

BLAST is a way of measuring similarity between sequences, but it does not use a tree to do so. So, effectively, it groups Homoplasy and Homology together and makes no distinction between them.

SIFTER group sequences by a tree, so it tries to ignore Homoplasy and focus on Homology.

If similarity was explained merely by common design for common purposes, we expect BLAST to work better than SIFTER. However, we see that SIFTER works better than BLAST. That is what we expect if common descent is true.

Of course, in this case, we are looking at (mainly) microbes. There is no clear reason why creationists should reject common ancestry of microbes. It is really the common ancestry of humans and apes that is really difficult to work through. The evidence for that was in the video I linked you.

Rumraket · February 8, 2021, 4:13pm

Similarly to how phylogeny better predicts function than mere similarity, phylogenetic reconstruction of inferred ancestral proteins outperforms consensus sequences for creating more thermostable and conformationally flexible proteins.
That is, proteins which have been constructed based on phylogenies are are more adaptable (can perform a wider range of functions), can tolerate more mutations, and remain functional at higher temperatures, than proteins constructed based on mere similarities between extant sequences.

jeffb · February 8, 2021, 8:07pm

Although that sounds interesting, that being as technical as that is, I’m going to have to put that at the bottom of my queue.

And just a note to everyone else: @swamidass has given me one good reply, and I can think of at least three follow-up questions to just that post (when I get a chance). Plus my other thread has multiple posts I have yet to reply to.

I know making these requests doesn’t always work, but can everyone please hold off on any more posts to this thread? I’m certainly not trying to rude, it’s just that I’m triaging these enough as it is.

swamidass · February 8, 2021, 8:07pm

Yes I agree. Let’s keep the pace slower here, and restricted largely to you and I.

jeffb · February 8, 2021, 10:15pm

Ah yes, makes more sense.
BLAST would include both homology and homoplasy, and SIFTER should be more aligned with just homology.

If I have that right, then on to this statement:

“We find that phylogeny informed predictions are two times better than similarity informed predictions.”

I’m trying to understand “better” here. Is it the case that both find that same similarity, only SIFTER culls data with an informed phylogeny, thereby finding it sooner (fewer iterations, searches,…)? Or better in what it finds?

John_Harshman · February 9, 2021, 12:52am

Surely one ought to have at least one phylogeneticist involved?

swamidass · February 9, 2021, 12:53am

Sure. Stay focused and on topic, without moving to fast. Let’s do one thing at a time.

swamidass · February 9, 2021, 7:19am

So, let’s look at the task here.

Given a particular DNA sequence (or a protein sequence), we want to know, what is the function of this protein?

We are going to answer this question using a database of proteins where we know the answer the question. We will guess that the unknown protein has the same function as the proteins that are “close” to it. The key thing though is that there are many methods of computing “closeness.”

We can measure the performance of different methods by measuring how accurate they are at guessing the function of an unknown protein.

We are now considering two methods of measuring closeness: SIFTER (tree-based closeness) and BLAST (similarity-based closeness). We can directly measure which one is more accurate at predicting the function of unknown proteins. We find that SIFTER works much better than BLAST.

Make sense now?

cdods · February 9, 2021, 3:29pm

First a thanks to Jeff for his questions & those who have answered. As someone who is biology and phylogeny illiterate, this has been a very helpful discussion. I hope a couple of questions are appropriate for this discussion.

I am still unclear with the definition of “function” in this context. Can you point me to any somewhat non-technical definition, or even some examples of functions in this context?

How can we directly measure similarity? Or more specifically what objective measure do we use for similarity?

swamidass · February 9, 2021, 3:29pm

The paper cited here answers all these questions and more: A Test of Common Descent vs. Common Function.

T_aquaticus · February 9, 2021, 4:03pm

I’m not an expert on these algorithms, so don’t take my word as gospel. Hopefully others will correct any mistakes I make.

The two main categories of protein function are enzyme activity and binding to other molecules. Enzymes are proteins that catalyze chemical reactions, such as breaking apart sugars to drive another chemical reaction that adds a phosphate group to ADP to make ATP. Antibodies bind to foreign substances to guide immune reactions, and also have domains that other proteins from the immune system can bind to.

In one of the SIFTER papers the authors looked at deaminase activity:

This enzyme replaces an amine group with an oxygen on an adenosine molecule (ignore the bars below the molecules):

There are many algorithms that try to align sequences and then determine which positions in the sequence are different. The two algorithms we are discussing in this thread are BLAST and SIFTER. BLAST is pretty straightforward in that it only only looks at the sequences themselves and tries to create the best alignment possible. For example, here is an alignment of human and chicken MMP3 using BLAST:

Score 490 bits
expect 7e-176
Identity 254/481(53%)
Positives 322/481(66%)
Gaps 8/481(1%)

Query  1    MKSLPILLLLCVAVCSAYPLDGAARGEDTSMNLVQKYLENYYDLKKDVKQFVRRKDSGPV  60
            MK+L  LLLL  A+  A+P     + E+  M L+QKYLENYY   KD + F+ + +S  +
Sbjct  1    MKNLQFLLLLYAALSHAFPAHTRQK-EEEGMQLIQKYLENYYSFTKDGESFIWKTNSA-M  58

Query  61   VKKIREMQKFLGLEVTGKLDSDTLEVMRKPRCGVPDVGHFRTFPGIPKWRKTHLTYRIVN  120
             KKI+EMQ+F GLEVTG+ DS  L++++K RCG PDV  F TF G PKW K  LTYRI+N
Sbjct  59   AKKIKEMQEFFGLEVTGRPDSSILDLVQKRRCGFPDVAGFSTFAGEPKWAKQVLTYRILN  118

Query  121  YTPDLPKDAVDSAVEKALKVWEEVTPLTFSRLYEGEADIMISFAVREHGDFYPFDGPGNV  180
            YTPDL    V++A++KA  +W  VTPL F +   G+ADIMISFA   H DF PFDGPG  
Sbjct  119  YTPDLRPADVNAAIKKAFSIWSSVTPLKFIKRDRGDADIMISFATGGHNDFIPFDGPGGS  178

Query  181  LAHAYAPGPGINGDAHFDDDEQWTKDTTGTNLFLVAAHEIGHSLGLFHSANTEALMYPLY  240
            +AHAYAPG    GDAHFD+DE WTK T G NLF VAAHE GHSLGLFHS    ALMYP+Y
Sbjct  179  VAHAYAPGKDFGGDAHFDEDETWTKSTEGANLFYVAAHEFGHSLGLFHSKEPNALMYPIY  238

Query  241  HSLTDLTRFRLSQDDINGIQSLYGPPP----DSpetplvptepvppepgtpANCDPALSF  296
                D + F L QDDINGIQSLYGP P    D  ++  +       EP  P +C P L+F
Sbjct  239  RKF-DPSVFPLHQDDINGIQSLYGPSPNTSNDQKDSAEIKDPTESKEPVLPNSCGPDLTF  297

Query  297  DAVSTLRGEILIFKDRHFWRKSLRKLEPELHLISSFWPSLPSGVDAAYEVTSKDLVFIFK  356
            DAV+T RGEI+ FKD+HFWRK     E +  L+S FWP LPSGVDAAYE+  +D + +FK
Sbjct  298  DAVTTFRGEIIFFKDKHFWRKHPAVREVDFDLVSLFWPRLPSGVDAAYEIPEEDKILLFK  357

Query  357  GNQFWAIRGNEVRAGYPRGIHTLGFPPTVRKIDAAISDKEKNKTYFFVEDKYWRFDEKRN  416
            GN+FW +RG  +  GYP+ ++ LGF   V KIDAA  D+ K K Y+F  +K+W +D++  
Sbjct  358  GNEFWVVRGETIPPGYPQKLYVLGFSKDVAKIDAAFYDRNKGKAYYFTANKFWSYDKRNQ  417

Query  417  SMEPGFPKQIAEDFPGIDSKIDAVfeefgffyffTGSSQLEFDPNAKKVTHTLKSNSWLN  476
            S++   P+ I + FPGI++ IDAVF+   F YFF G  Q EFDP+ K+VT  LK+N W +
Sbjct  418  SVDRK-PRLIKDAFPGINANIDAVFQYENFLYFFQGRKQFEFDPDKKRVTRLLKTNFWFS  476

Query  477  C  477
            C
Sbjct  477  C  477

You have the two sequences and the shared amino acids listed in the middle of them. You will notice that there is a gap in the sequence marked by dashes (GPPP----DSpetplvpte). The algorithm introduced that gap to increase the overall match between sequences. The algorithm also tells you how many positive matches there are (including chemically similar amino acids marked by +) and what the overall similarity is. The score (if I understand it correctly) is the algorithm’s measurement of how close the two sequences are to one another.

SIFTER also measures closeness but it includes phylogenetic information which BLAST does not.

swamidass · February 9, 2021, 4:09pm

That phylogenetic information is an inferred history of common descent.

cdods · February 9, 2021, 5:01pm

Thanks, that clarifies things a lot. I appreciate you taking the time to teach what I probably should have learned in high school. I find it a lot more interesting now than I did then.

jeffb · February 9, 2021, 10:31pm

Thank you. That does make it a little clearer, getting closer to grasping it. I still feel like I need a more solid understanding in order to assess its potential for evaluating descent vs design. Btw that’s not any kind of ‘dismissal’, because I really am trying to follow this.

Also I’m sorry it’s taking me a while to get back to this. And I feel like I may fall behind in class here…
(BTW I’ve given up on that other thread of mine, I just can’t keep up).

@swamidass, you were gracious to try and keep this thread paced for me, but since I’ve been real busy lately (especially this week), I’m probably going to be going too slow for others. Not sure what to do about that. Should I drop off and let others discuss it?

Perhaps this could help me: Is there an analogy here within my domain: software development, that might help me follow this one more accurately?

Given:

There are many different types of software application (and functions within them): reading/writing to DBs, communicating over networks, many varieties of User Interfaces, Desk-top, phone web. All these within a nested hierarchy.
Software lines of code compile into machine code (similar to DNA).

With that in mind, would this be similar: Needing to search through existing machine code to find the function of various sections within it, and using other existing apps to assist. One (like BLAST), does pure machine-code matching, another (like SIFTER) uses commonality with a nested hierarchy to try and find matches.

So, a good analogy or not?

swamidass · February 9, 2021, 10:48pm

No. This thread is for you and I. Take your time.

In general, computer code is so limited a parallel, that it quickly becomes misleading. One issue is that, unlike the DNA we find in biology, computer code just does not follow a tree like structure at all. So no one has proposed something like SIFTER to find similar code, and no one expect that such an algorithm would work terribly well.

To reiterate, computer software does not fit in nested hierarchy at all really.

jeffb · February 10, 2021, 10:17pm

Very gracious of you, thank you. I suppose I need to step it up then…

Ok, so I have an idea here. Instead of me trying to gain a much better understanding of the models (which would really slow things down), let me provide my initial feelings contrary to this approach. I’m sure you’ve thought these through, so I’ll let you respond to them as follows:

Modeling God accurately: According to the article at OUP.com, the creators of SIFTER modeled their algorithm off of phylogeny. I’m assuming they did their best to model it accuratly. Then the article says: “BLAST (27) is a sequence matching method”. And the obvious question is: Does “sequence matching” represent God accurately? Followed by: even if it might, did these creators model God accurately? To me, that’s a big pill to swallow. I don’t know if I can answer that. Or if I can take other people’s word that “yes it does”.
Correlation vs. Causality & algorithm performance: I can certainly see how SIFTER performs better than BLAST. It applies an added heuristic in order to improve search results. But this is not causality, it’s simply correlation. Grouping items together by similarity, then running a search algorithm based on that will almost always perform better, even if it’s simply a correlation. All developers know that linear sequence matched algorithms are the least performant searches.
Circularity. Using phylogeny to prove phylogeny to me seems a bit circular. But I am open to other views on that.
Only (or mainly?) for microbes As compelling as it might be, I would like to see this broaden to more than microbes. Technology is certainly advancing, at quite a rapid pace. It’s great to see. I’m not being dismissive here, but I am saying that I don’t mind holding off on a conclusion on something like this, simply because more data is continually coming in.
Validates ID?? Just a final comment: So if BLAST did accurate model God in a testable way and thereby support common descent over design, doesn’t it also validate design test-ability.

To summarize: it makes sense to me that SIFTER performs better even with design (based on that heuristic), and I don’t feel like BLAST represents God accurately.

So…I’m hoping posting that will help speed things up a bit. Tell me your thoughts.

Lastly real quick: although there’s not a lot gained to debate it, I have to respectfully disagree. We can define a nested hierarchy of software. This goes back to point #2 a bit. If I did want to create a similar application for searching machine code, first creating a nested hierarchy of software applications, then using that as a heuristic in the search algorithm would improve its performance.

swamidass · February 10, 2021, 10:28pm

Is it possible you are misunderstanding this test?

We are not considering God vs not God, or designed vs. not designed. Rather, we are considering whether or not similarity in protein sequences are better explained by common descent or common function.

None of this rules out God’s role in common descent. None of this models God. None of this rules out design.

Is it possible that you do not know what a nested hierarchy is?

Let me give you an example to consider. Imagine several software programs that do or do not important specific modules:

| | M1 | M2 | M3 |
|—|—|—|—|—|—|
| P1 | X | X | X
| P2 | X | X
| P3 | | | X |
| P4 | | X | X |
You can extend this table, even with real data. Though I have not specified specific programs and modules here, you should know this is hardly an unreasonable example.

Can you put these programs into nested clades, without any discordant features? If you can, they fit into nested clades. If you can’t, they don’t.

You will find that you can’t put them into nested clades. And, if we increased the number of modules and the number of programs, the problem would get harder. More and more features would be discordant in more and more places.

John_Harshman · February 11, 2021, 1:28am

Don’t you mean whether common function is better explained by common descent or sequence similarity?

swamidass · February 11, 2021, 1:29am

Nope. I was pretty clear, and I think that’s what I meant!

Topic		Replies	Views
A Test of Common Descent vs. Common Function Conversation	54	2293	January 31, 2021
Gpuccio on Common Descent Conversation Science	1	750	August 26, 2019
Phylogeny - Help me see what you see Conversation Science	128	3560	February 6, 2021
Phylogeny and Incongruent Trees Conversation	153	3070	February 22, 2021
Jackson Wheat: Two Debates on Common Descent Conversation Introduction	4	647	August 10, 2020

JeffB and Swamidass: Understanding Evidence for Phylogeny

Related topics