Mechanisms of the origination of new protein-coding genes

Rapid evolution of protein diversity by de novo origination in Oryza.
Zhang L, Ren Y, Yang T, Li G, Chen J, Gschwend AR, Yu Y, Hou G, Zi J, Zhou R, Wen B, Zhang J, Chougule K, Wang M, Copetti D, Peng Z, Zhang C, Zhang Y, Ouyang Y, Wing RA, Liu S, Long M.
Nat Ecol Evol. 2019 Apr;3(4):679-690. doi: 10.1038/s41559-019-0822-5. Epub 2019 Mar 11.

A recent study provides important insight, and mechanistic confirmation, into the origins of new protein coding genes. This study capitalized on the availability of complete genomes of many related rice species. The abstract (below) gives an excellent summary of the study and the results.

New protein-coding genes that arise de novo from non-coding DNA sequences contribute to protein diversity. However, de novo gene origination is challenging to study as it requires high-quality reference genomes for closely related species, evidence for ancestral non-coding sequences, and transcription and translation of the new genes. High-quality genomes of 13 closely related Oryza species provide unprecedented opportunities to understand de novo origination events. Here, we identify a large number of young de novo genes with discernible recent ancestral non-coding sequences and evidence of translation. Using pipelines examining the synteny relationship between genomes and reciprocal-best whole-genome alignments, we detected at least 175 de novo open reading frames in the focal species O. sativa subspecies japonica, which were all detected in RNA sequencing based transcriptomes. Mass spectrometry-based targeted proteomics and ribosomal profiling show translational evidence for 57% of the de novo genes. In recent divergence of Oryza, an average of 51.5 de novo genes per million years were generated and retained. We observed evolutionary patterns in which excess indels and early transcription were favoured in origination with a stepwise formation of gene structure. These data reveal that de novo genes contribute to the rapid evolution of protein diversity under positive selection.

Fig. 4a of this paper provides a nice overview. Basically, the authors find evidence for three different pathways by which new protein-coding genes may arise. One involves the origination of an open reading frame, followed by some means of transcription “activation” (not to be confused with the term as typically used in molecular biology). Another involves near-simultaneous evolution of a complete, transcriptionally-active protein coding region. The third entails the evolution of protein-coding capacity in transcription units that encode non-coding RNAs. As the figure shows, the vast majority of new genes arise by this third pathway. This casts the genome-wide transcriptional activity seen in ENCODE and other genome projects in an interesting light.

The paper is behind a paywall, but I am happy to email a pdf to anyone who asks in a message.


Hi Art,

Please send me the pdf. Neither the University of Chicago, or Northwestern (my usual sources), carry this title, which means Springer has made the journal inexplicably costly. Paywall publishing can’t die soon enough for me.



P.S. OOPS – have it already. [embarrassed face here] Never mind, thanks for the offer.


Too late. More clutter in your email box.


Clarification: the vast majority of de novo genes arise by that pathway. The vast majority of new genes arise by duplication and divergence of prior genes. Right?

Another question: how do you separate translation of functional products from accidental transcription and translation of non-functional sequences?


That is correct. Thanks for catching this. Sorry for confusing people.

An excellent question. From the paper:

Evidence for the functionality of de novo genes. Although the de novo gene candidates were identified from the strictly annotated genes in the 13 Oryza genomes using a uniform annotation pipeline that minimized potential methodological artefacts38, we further examined these gene candidates for their functionality and, especially, their translational evidence. We examined the potential functionality of these de novo gene candidates with several lines of evidence by characterizing their structure, expression and evolutionary constraints. First, all candidates have intact ORFs that are, on average, 137 amino acids long, with GC content typical of protein-coding genes (59.1% compared with a 56.8% genome average; Supplementary Table 4). Second, every de novo gene candidate has evolved a tissue-biased or -specific expression pattern, just like the other functional genes in the genome (Supplementary Table 6). Third, all candidates have intact gene structures.

We analysed the sequence evolution to detect substitution signals and determine the functionality of de novo gene sequences. We applied the branch model in PAML60 to 236 candidate de novo genes that have 3 orthologous sequences, with the aim of identifying genes that showed a signal of natural selection as detected in their sequence substitutions at synonymous sites (dS), non-synonymous sites (dN) and ω = dN/dS. The likelihood ratio test and Akaike information criterion (AIC) identified 28 candidate de novo genes that are incompatible with the model of neutrality, with ω either significantly lower than 1 (22 genes) or higher than 1 (6 genes), suggesting that they may undergo negative or positive selection (Supplementary Table 9). In the 45 candidate de novo genes that have only 2 orthologues, we detected 2 genes with ω significantly lower or higher than 1, whereas most of them had ω ratios lower than 1. Together, we detected 30 candidate de novo genes with substitution signals of negative or positive selection. The remaining genes had lower statistical power due to small numbers of substitutions (Supplementary Table 9). These results support the coding potential of the candidate de novo genes, prompting us to further explore experimental evidence for their translation.


I’m concerned about using tissue-specific expression as evidence of function, since it happens in non-functional sequences too. The evidence of selection is better.


I will remember this research the next time I put some stevia into a cup of tea.


This paper is on the same topic and addresses some of the issues already raised in this thread (I think).


It’s fascinating to see how the field has shifted so much. As little as 10 years ago the possibility of de novo origination was still quite speculative and little evidence for the mechanisms by which it can happen existed. Sequencing technology, international public databases, and search tools have really made an enormous difference for the development of this field.


Of interest (really beautiful new paper):

1 Like

What lesson do you derive from that paper?

1 Like

That’s an interesting paper, @sfmatheson. On a quick reading, it may give pause to excessive reliance on signatures of selection as an indication of function. Maybe @John_Harshman can weigh in on this.