Discovering Orphan genes

Guerzoni and McLysaght (2016) claimed to have identified 35 orphan genes specific to the great apes. There’s something I don’t understand about the methodology used to find these genes. Here’s what they did:

We compared the complete human proteome with that of chimpanzee, gorilla, orangutan, and macaque using BLAST. Candidate de novo genes were those where none of the potential proteins of the gene had hits in orangutan or macaque. […] This resulted in 734 candidate de novo genes from EnsEMBL v60 and an additional 67 genes from subsequent EnsEMBL versions (v61–69). […] In order to unambiguously show that a given gene has arisen de novo it is necessary to demonstrate that the ancestral sequence was noncoding. We used tBLASTn to search for the orthologous DNA in outgroup primate genomes. The outgroup orthologous DNA was identifiable for 233 genes.

This implies that there are ~500 genes that do not have orthologous DNA sequences in orangutan and macaque, quite a large number if you ask me. But why did they exclude all of them from their analysis? They write:

The method we use here builds on the approach of Knowles and McLysaght (2009) where initially plausible de novo genes are examined for evidence of the absence of the gene in the ancestor, as well as for supporting evidence. This approach requires the detection of the orthologous DNA sequence in the outgroup lineage, otherwise the gene is excluded as ambiguous.

But wouldn’t it be much more interesting to find out more about those genes that do not even have orthologous sequences in closely related lineages? Why don’t they qualify as orphan genes?

1 Like

@Ignostic it isn’t as interesting as you might think. This is often caused by sequencing gaps. Did you know that we don’t have 100% of the chimp, orangutan and macaque genome sequenced yet?

We work by looking for all possible causes for anomalies like this, but they did not have the data to rule out several likely but boring causes. That is why they are labeled ambiguous.

@roohif @glipsnort @evograd @davecarlson


They were labeled as ambiguous and then excluded from analysis at that time. This is not at all the same as saying they “don’t qualify as orphan genes.” This should have been obvious from the choice of the word ‘ambiguous.’

1 Like

That’s just a matter of semantics. My point still stands: why not look at the large number of orphan genes that apparantly lack any orthologous counterpart in the macaque/orangutan genome? From an evolutionary perspective those are far more interesting, although this may just be an artifact of incomplete data.

I am sure they will look, when they have more data available.


Nope. It’s careful science.

Your point doesn’t stand, because your question (“Why not look…?”) is silly. You are ignoring the English words in the preceding posts that answer the question.

1 Like

Because. . .

Or rather, most of them are very likely artifacts of incomplete data. Which makes them quite uninteresting to researchers. There are enough interesting things to study that are probably real.


It would be a huge waste of time to do a lot of work on putative orfan genes only later to have most of it overturned by better data collection and annotation. It’s better to wait for more complete data and THEN start analyzing it.


This is a young field, with unstable terminology, messy categories, and enough uncertainty and open questions to motivate a few dozen PhD dissertations per year:

“…it is important to note that there is a lack of consensus about what constitutes a genuine de novo gene birth event. One reason for this is a lack of agreement on whether or not the entirety of the newly genic sequence must be non-genic in origin. With respect to protein-coding de novo genes, it has been proposed that de novo genes be divided into subtypes corresponding to the proportion of the ORF in question that was derived from previously non-coding sequence [48]. Furthermore, for de novo gene birth to occur, the sequence in question must not just have emerged de novo but must in fact be a gene. Accordingly, the discovery of de novo gene birth has also led to a questioning of what constitutes a gene, with some models establishing a strict dichotomy between genic and non-genic sequences, and others proposing a more fluid continuum (see below). All definitions of genes are linked to the notion of function, as it is generally agreed that a genuine gene should encode a functional product, be it RNA or protein. There are, however, different views of what constitutes function, depending in part on whether a given sequence is assessed using genetic, biochemical, or evolutionary approaches [48, 68, 69].”

“…a questioning of what constitutes a gene…” – that goes deep, into a murky forest of puzzles.

From here (p. 9; bold emphasis above is mine), open access:

1 Like

But do we actually have supporting evidence to conclude that this is just an artifact? To assume that ~500 potential orphan genes lack any orthologous non-genic sequences just due to incomplete data banks… that appears to be a bit of a stretch. But maybe I’m just being foolish. How much of the macaque/orangutan genome has actually been sequenced? Is it even possible to estimate such a percentage? I couldn’t find any information on this.

@Ignostic, no we not yet sure it is an artifact. As we stated, and they stated, these cases are “ambiguous.” We do not know from evidence yet what they are. When we get more evidence, I am sure someone will look again.

No one has assumed anything. We said “we do not know yet, but think it is missing data.”

Probably less than 90%, as the human genome is around 90%. Just recently new technology is allowing for sequencing of full chromosomes. So this could change in the near future: First Complete Sequence of a Human Chromosome.


According to NCBI, the ungapped assembly lengths for the current orangutan and Rhesus macaque genomes are 3,043,444,524 and 2,936,892,733 bp, respectively.

According to the Animal Genome Size Database, the c-values for orangutan range from 3.60 to 4.10 picograms. Likewise the c-values for rhesus macaque range from 3.14 to 3.59 picograms.

If we arbitrarily take the smallest of these values, and apply the standard conversion of 1pg = 978 Mb, this gives us estimated total genome sizes of 3.52 and 3.07 Gb for the orangutan and rhesus macaque.

Assuming that all the above numbers are approximately correct, then their respective assemblies represent 86% and 95.6% of the “true” genome sizes.


You’ve been told how researchers view these: most are likely artifacts, since there are bound to be many artifacts given the incomplete state of the genomes. There are probably some real ones in there, too. Sorting them out is really not possible until better data comes along, and time and resources are too scarce to waste them on most mostly a whole lot of nothing. Just what do you think should be done with them, anyway? Doesn’t it make more sense to pursue better understanding of confirmed cases than mostly spurious ones?


Just what do you think should be done with them, anyway?

I think we should have a look into it. I don’t believe de novo emergence of genes to be impossible, although they are most certainly a rare occurence. But orphan genes don’t just pop into existence out of thin air. Even if those candidate genes are not real “orphans” afterall and even if they are not functional, they must have a precursor sequence. If that sequence cannot be found, then we have to invoke gene rearrengements, extremely high rates of sequence divergence or loss of genetic material. If that cannot be proven either… then I don’t know what could possibly account for their existence. It would just be a mystery. And I think that mysteries should be solved.

This all seems fair enough in my view and I agree. All I would add is a wet blanket on the fire. We want answers!, but science takes time. :slight_smile:

1 Like

By all means, go for it. What form would “looking into it” take, though? As I said, what are you proposing that researchers actually do?

You’ve given no reason to think this mystery even exists. Given the state of the genomes, we know that false positives will occur. Quite a lot of them. So what’s your reason for thinking that these aren’t mostly false positives, along with a few additional genuine cases like the ones already in the books?


So I suppose I was right in the ball park there.

I apologize for what may be a repetitious theme in my replies, but there exists an abundance of excellent open access papers on orphan / de novo / taxonomically restricted genes, for anyone who wants to pursue the matter in greater depth. Most authors take the orphans puzzle seriously; in other words, orphans are not simply artifacts of poor sampling or incomplete genome annotation:

One of the surprises of the genomic era was that gene birth is not a dead process. The prior paradigm, that proteins evolve only by gradual “tinkering” with existing material [1], was contradicted when the sequencing of the first genomes uncovered many species-specific “orphan” genes [2]. Most researchers argued then that the uniqueness of these genes was an artifact of sparse sampling or bad gene prediction, and that when enough genomes were sequenced, all correctly annotated genes would cluster into large, ancient families. But more sequencing proved exactly the opposite. Researchers have shown that not only can genes encoding novel proteins arise de novo [2, 3], but they do so often, as shown, for example, in animals [4–6], plants [7], protists [8], and yeast [9]. In addition to arising de novo, orphan genes could be derived from a very rapid mutation of existing CDSs beyond recognition [10], although we are unaware of specific evidence for this phenomenon. Although most of the approximately several billion orphan genes in extent eukaryotes [11] have never been studied, functions are being shown for a growing minority.

From here: fagin : synteny-based phylostratigraphy and finer classification of young genes | BMC Bioinformatics | Full Text

This group’s estimate of “several billion orphan genes in extant eukaryotes” is based on their calculations in another publication (unfortunately, not open access, but I’ll send the pdf to anyone interested):

“Protein-coding orphan genes comprise ∼1–10% of genes in most eukaryotic genomes [47]. Using a conservative estimate of 15,000 genes/eukaryotic species [49], then there will be around 100–1000 orphan genes per species. Assuming there are around 10 million eukaryotic species on earth [50], then the total number of unique orphan protein sequences will be on the order of 1–10 billion.”

From here: Raising orphans from a metadata morass: A researcher's guide to re-use of public 'omics data. - PubMed - NCBI

1 Like

I don’t believe anyone has suggested that orphan genes are simply artifacts of poor sampling or incomplete genome annotation.


Glipsnort wrote:

“I don’t believe anyone has suggested that orphan genes are simply artifacts of poor sampling or incomplete genome annotation.”

From the BMC Bioinformatics paper (Arendsee et al. 2019) cited in my post above:

“Most researchers argued then that the uniqueness of these genes was an artifact of sparse sampling or bad gene prediction, and that when enough genomes were sequenced, all correctly annotated genes would cluster into large, ancient families.”