Mechanisms of the origination of new protein-coding genes

Art · August 12, 2019, 2:50pm

Rapid evolution of protein diversity by de novo origination in Oryza.
Zhang L, Ren Y, Yang T, Li G, Chen J, Gschwend AR, Yu Y, Hou G, Zi J, Zhou R, Wen B, Zhang J, Chougule K, Wang M, Copetti D, Peng Z, Zhang C, Zhang Y, Ouyang Y, Wing RA, Liu S, Long M.
Nat Ecol Evol. 2019 Apr;3(4):679-690. doi: 10.1038/s41559-019-0822-5. Epub 2019 Mar 11.

A recent study provides important insight, and mechanistic confirmation, into the origins of new protein coding genes. This study capitalized on the availability of complete genomes of many related rice species. The abstract (below) gives an excellent summary of the study and the results.

New protein-coding genes that arise de novo from non-coding DNA sequences contribute to protein diversity. However, de novo gene origination is challenging to study as it requires high-quality reference genomes for closely related species, evidence for ancestral non-coding sequences, and transcription and translation of the new genes. High-quality genomes of 13 closely related Oryza species provide unprecedented opportunities to understand de novo origination events. Here, we identify a large number of young de novo genes with discernible recent ancestral non-coding sequences and evidence of translation. Using pipelines examining the synteny relationship between genomes and reciprocal-best whole-genome alignments, we detected at least 175 de novo open reading frames in the focal species O. sativa subspecies japonica, which were all detected in RNA sequencing based transcriptomes. Mass spectrometry-based targeted proteomics and ribosomal profiling show translational evidence for 57% of the de novo genes. In recent divergence of Oryza, an average of 51.5 de novo genes per million years were generated and retained. We observed evolutionary patterns in which excess indels and early transcription were favoured in origination with a stepwise formation of gene structure. These data reveal that de novo genes contribute to the rapid evolution of protein diversity under positive selection.

Fig. 4a of this paper provides a nice overview. Basically, the authors find evidence for three different pathways by which new protein-coding genes may arise. One involves the origination of an open reading frame, followed by some means of transcription “activation” (not to be confused with the term as typically used in molecular biology). Another involves near-simultaneous evolution of a complete, transcriptionally-active protein coding region. The third entails the evolution of protein-coding capacity in transcription units that encode non-coding RNAs. As the figure shows, the vast majority of new genes arise by this third pathway. This casts the genome-wide transcriptional activity seen in ENCODE and other genome projects in an interesting light.

The paper is behind a paywall, but I am happy to email a pdf to anyone who asks in a message.

pnelson · August 12, 2019, 2:59pm

Hi Art,

Please send me the pdf. Neither the University of Chicago, or Northwestern (my usual sources), carry this title, which means Springer has made the journal inexplicably costly. Paywall publishing can’t die soon enough for me.

Thanks!

PN

P.S. OOPS – have it already. [embarrassed face here] Never mind, thanks for the offer.

Art · August 12, 2019, 3:10pm

Too late. More clutter in your email box.

John_Harshman · August 12, 2019, 4:36pm

Clarification: the vast majority of de novo genes arise by that pathway. The vast majority of new genes arise by duplication and divergence of prior genes. Right?

Another question: how do you separate translation of functional products from accidental transcription and translation of non-functional sequences?

Art · August 12, 2019, 5:06pm

That is correct. Thanks for catching this. Sorry for confusing people.

An excellent question. From the paper:

Evidence for the functionality of de novo genes. Although the de novo gene candidates were identified from the strictly annotated genes in the 13 Oryza genomes using a uniform annotation pipeline that minimized potential methodological artefacts38, we further examined these gene candidates for their functionality and, especially, their translational evidence. We examined the potential functionality of these de novo gene candidates with several lines of evidence by characterizing their structure, expression and evolutionary constraints. First, all candidates have intact ORFs that are, on average, 137 amino acids long, with GC content typical of protein-coding genes (59.1% compared with a 56.8% genome average; Supplementary Table 4). Second, every de novo gene candidate has evolved a tissue-biased or -specific expression pattern, just like the other functional genes in the genome (Supplementary Table 6). Third, all candidates have intact gene structures.

We analysed the sequence evolution to detect substitution signals and determine the functionality of de novo gene sequences. We applied the branch model in PAML60 to 236 candidate de novo genes that have ≥ 3 orthologous sequences, with the aim of identifying genes that showed a signal of natural selection as detected in their sequence substitutions at synonymous sites (dS), non-synonymous sites (dN) and ω = dN/dS. The likelihood ratio test and Akaike information criterion (AIC) identified 28 candidate de novo genes that are incompatible with the model of neutrality, with ω either significantly lower than 1 (22 genes) or higher than 1 (6 genes), suggesting that they may undergo negative or positive selection (Supplementary Table 9). In the 45 candidate de novo genes that have only 2 orthologues, we detected 2 genes with ω significantly lower or higher than 1, whereas most of them had ω ratios lower than 1. Together, we detected 30 candidate de novo genes with substitution signals of negative or positive selection. The remaining genes had lower statistical power due to small numbers of substitutions (Supplementary Table 9). These results support the coding potential of the candidate de novo genes, prompting us to further explore experimental evidence for their translation.

John_Harshman · August 12, 2019, 5:27pm

I’m concerned about using tissue-specific expression as evidence of function, since it happens in non-functional sequences too. The evidence of selection is better.

Chris_Falter · August 12, 2019, 9:16pm

I will remember this research the next time I put some stevia into a cup of tea.

sfmatheson · August 13, 2019, 11:28pm

This paper is on the same topic and addresses some of the issues already raised in this thread (I think).

https://www.genetics.org/content/212/4/1353

Rumraket · August 14, 2019, 8:43am

It’s fascinating to see how the field has shifted so much. As little as 10 years ago the possibility of de novo origination was still quite speculative and little evidence for the mechanisms by which it can happen existed. Sequencing technology, international public databases, and search tools have really made an enormous difference for the development of this field.

pnelson · August 14, 2019, 1:00pm

Of interest (really beautiful new paper):

John_Harshman · August 14, 2019, 1:05pm

What lesson do you derive from that paper?

Art · August 16, 2019, 12:56am

That’s an interesting paper, @sfmatheson. On a quick reading, it may give pause to excessive reliance on signatures of selection as an indication of function. Maybe @John_Harshman can weigh in on this.

Topic		Replies	Views
Evidence favoring de novo gene evolution, and an actual population genetic model of de novo gene gain Conversation Science	8	768	September 9, 2021
Possibly functional de novo genes might have evolved in the LTEE Conversation Science	8	533	January 8, 2024
James Tour on Orphan Genes Conversation	46	3053	July 5, 2019
Genes that evolve from scratch expand protein diversity: Conversation Article	4	360	March 2, 2021
Constructive Neutral Evolution Conversation Science	86	5545	July 15, 2020

Mechanisms of the origination of new protein-coding genes

Related topics