How evolution builds genes from scratch

Nice news piece (open access) in Nature about the developing research field of de novo / orphan / taxonomically restricted (TRGs) / lineage specific genes. The article highlights a point I’ve made elsewhere on this discussion board: the methodological youth of the field, and (hence) the widely-varying answers about how many orphans or TRGs a species possesses:

The study gets at one of the field’s biggest preoccupations: how to tell whether a gene is truly de novo. Answers vary wildly, and approaches are still evolving. For example, an early study found 15 de novo genes in the whole primate order; a later attempt found 60 in humans alone. One option for finding candidate de novo genes is to use an algorithm to search for similar genes in related species. If nothing shows up, then it’s possible that the gene arose de novo. But failing to find a relative doesn’t mean no relative is there: the gene could have been lost along the way, or might have shape-shifted far away from its kin.

Worth a look.

Mod edit, adding link: How evolution builds genes from scratch


Is it too much to ask for a link? Never mind, I’ll do it myself. But you really should have.

1 Like

Ow! What a goofball I am – sorry, John. Link:

1 Like

Note that the message there seems antithetical to what you want to think: these de novo genes don’t emerge from nowhere; they emerge from non-genic sequences in the ancestor and still present in close relatives. They’re only new as genes, not as sequences in the genome. No poofing, no big mystery.


From the article, a figure that is a good template for discussion:


Run that story backwards to the common ancestors of groups (and then, to their common ancestors). Think about the total content and sequence diversity of accumulating non-genic DNA.

The mystery doesn’t go away. It just moves to a different room in the house.

In the context of the figure above, I don’t see the problem.

Again, referring to the figure, where is this room?


I thought about it, but I can’t figure out what you think it means. I don’t see a mystery.

1 Like

Science has only recently acquired the tools needed to approach these types of questions, so the youth of the field is to be expected. When I first started working in a lab over 20 years ago we were still using radioactive nucleotides to do Northern blots, and DNA chips were all the new rage. We were still 5 years off from the first draft of the human genome (which still isn’t 100% complete).

One of the most interesting changes over this time is that molecular biologists have gone from being starved for data to being swamped by data. Much of the focus now is on how to analyse all of the sequence and transcript data, what it means, and where to draw certain lines. The line that matters here is the line between irrelevant non-specific transcription and real genes. If you look hard enough you can find a transcript for nearly every base in the human genome, but science is quickly learning that this is most likely background noise.

So what is a gene? How many are there? Where did they come from? All good questions that are being tackled by scientists with new and amazing tools.


It’s in the common ancestor of this species and its nearest relatives, when those species are linked with their relatives (as clades grow in size and higher taxonomic rank). Total sequence diversity (coding and non-genic DNA) climbs steeply as one moves deeper in branching nodes in any cladogram. Instead of going down, the DNA census goes WAY up.

Easy analogy: natural language. Dictionaries do not get smaller as one samples texts within a language, although one expects the curve to flatten as the relevant search space is saturated. The problem with common ancestry, however, is one wants to build the gene and non-genic count for any species from a smaller count in its common ancestors.

Orphan and TRG curves do not flatten. They grow with every new genome sequenced.

A mutation creates a putative promoter sequence that increases transcription of a previously non-genic region of DNA. If that mutation results in an increase in fitness the mutation is selected for. Neutral mutations can reach fixation through drift. It doesn’t seem to be much of a mystery.


It is the mutations accumulated since the lineages split that result in orphan genes.

Orphan genes are detected in transcript data, not sequence data.


I think you’re using “sequence diversity” in a non-standard and, to me, indecipherable way there. Are you saying, perhaps, that reconstructed genome sizes increase as you go back in time? I’d like to see a source for that claim.

As they should if new genes are arising all the time. Still not seeing the mystery.


The de novo hypothesis proposed holds that new genes arise from fortuitous mutations in non-genic DNA.

For that scenario to work, non-genic DNA must be available to mutate. Lots of it, in fact, with high initial sequence diversity.

Fortunately for the scenario, there’s plenty of non-genic DNA. Still waiting for the mystery. Are you asking how non-genic DNA arises? There’s plenty of literature on that too. Have you looked?


Do a PubMed search on the “genome of Eden” problem.

What is it about creationists and the substitution of coy hints for real discussion?


Here’s a cite to get you started – I’m not coy, I’m just lazy:

"In any single such case, the doubter might still go on to claim (unparsimoniously) that the gene was present in the last common bacterial/archaeal ancestor and lost independently at least three and maybe five times (depending on what we take to be the tree relating sequenced archaea). However, this line of reasoning when applied consistently to all cases of patchy gene distribution assigns more and more genes to ancient common ancestral genomes, with every new genome sequenced. In the end, it requires a last universal common ancestor with an enormous range of metabolic capabilities—essentially any gene that is now seen in at least one contemporary archaean, and one contemporary bacterium must find a home in that ancient cell. Such a ‘genome of Eden’ hypothesis seems, to us, unappealing."

(p. 46, my emphasis)

Open access paper here:

@pnelson, are you saying that the genome of some hypothetical ancestor (for a lineage of interest that possesses orfans) is inevitably going to be so small that it cannot have given rise to the large swaths of non-genic regions we see in most eukaryotic genomes? This doesn’t make any sense to me, because it doesn’t seem to agree with what we know about genome dynamics. (Or maybe I am just not getting the point…)

Also, I am having a hard time seeing how your scenario relates to bacteria.


In other words, Doolittle et al. are arguing for a scenario that we are all too familiar with here at PS - namely that new protein-coding genes arise de novo, and not mixing and matching of extant protein-coding genes in some hypothetical ancestor.

De novo genes (with biological function) arise all the time. There really isn’t any doubt about this.

I am still thinking I am missing something here. Thanks for being patient with me, @pnelson .