Guerzoni and McLysaght (2016) claimed to have identified 35 orphan genes specific to the great apes. There’s something I don’t understand about the methodology used to find these genes. Here’s what they did:
We compared the complete human proteome with that of chimpanzee, gorilla, orangutan, and macaque using BLAST. Candidate de novo genes were those where none of the potential proteins of the gene had hits in orangutan or macaque. […] This resulted in 734 candidate de novo genes from EnsEMBL v60 and an additional 67 genes from subsequent EnsEMBL versions (v61–69). […] In order to unambiguously show that a given gene has arisen de novo it is necessary to demonstrate that the ancestral sequence was noncoding. We used tBLASTn to search for the orthologous DNA in outgroup primate genomes. The outgroup orthologous DNA was identifiable for 233 genes.
This implies that there are ~500 genes that do not have orthologous DNA sequences in orangutan and macaque, quite a large number if you ask me. But why did they exclude all of them from their analysis? They write:
The method we use here builds on the approach of Knowles and McLysaght (2009) where initially plausible de novo genes are examined for evidence of the absence of the gene in the ancestor, as well as for supporting evidence. This approach requires the detection of the orthologous DNA sequence in the outgroup lineage, otherwise the gene is excluded as ambiguous.
But wouldn’t it be much more interesting to find out more about those genes that do not even have orthologous sequences in closely related lineages? Why don’t they qualify as orphan genes?