Rapid evolution of protein diversity by de novo origination in Oryza.
Zhang L, Ren Y, Yang T, Li G, Chen J, Gschwend AR, Yu Y, Hou G, Zi J, Zhou R, Wen B, Zhang J, Chougule K, Wang M, Copetti D, Peng Z, Zhang C, Zhang Y, Ouyang Y, Wing RA, Liu S, Long M.
Nat Ecol Evol. 2019 Apr;3(4):679-690. doi: 10.1038/s41559-019-0822-5. Epub 2019 Mar 11.
A recent study provides important insight, and mechanistic confirmation, into the origins of new protein coding genes. This study capitalized on the availability of complete genomes of many related rice species. The abstract (below) gives an excellent summary of the study and the results.
New protein-coding genes that arise de novo from non-coding DNA sequences contribute to protein diversity. However, de novo gene origination is challenging to study as it requires high-quality reference genomes for closely related species, evidence for ancestral non-coding sequences, and transcription and translation of the new genes. High-quality genomes of 13 closely related Oryza species provide unprecedented opportunities to understand de novo origination events. Here, we identify a large number of young de novo genes with discernible recent ancestral non-coding sequences and evidence of translation. Using pipelines examining the synteny relationship between genomes and reciprocal-best whole-genome alignments, we detected at least 175 de novo open reading frames in the focal species O. sativa subspecies japonica, which were all detected in RNA sequencing based transcriptomes. Mass spectrometry-based targeted proteomics and ribosomal profiling show translational evidence for 57% of the de novo genes. In recent divergence of Oryza, an average of 51.5 de novo genes per million years were generated and retained. We observed evolutionary patterns in which excess indels and early transcription were favoured in origination with a stepwise formation of gene structure. These data reveal that de novo genes contribute to the rapid evolution of protein diversity under positive selection.
Fig. 4a of this paper provides a nice overview. Basically, the authors find evidence for three different pathways by which new protein-coding genes may arise. One involves the origination of an open reading frame, followed by some means of transcription “activation” (not to be confused with the term as typically used in molecular biology). Another involves near-simultaneous evolution of a complete, transcriptionally-active protein coding region. The third entails the evolution of protein-coding capacity in transcription units that encode non-coding RNAs. As the figure shows, the vast majority of new genes arise by this third pathway. This casts the genome-wide transcriptional activity seen in ENCODE and other genome projects in an interesting light.
The paper is behind a paywall, but I am happy to email a pdf to anyone who asks in a message.