James Tour on Orphan Genes

@evograd is working on it now. When we have it concisely and clearly explained, I’ll pass it on to him. We will show him the data. We will see what he thinks of it.

I have to say that’s a really bad statement of neutral drift, though it could perhaps be just bad writing rather than misunderstanding.

He is a chemist. That is pretty good for a chemist. (Though @jordan is making some serious headway)

First, that’s insulting to chemists. Second, he’s a chemist presuming to judge evolutionary biology, which gives him a responsibility to get it right.

1 Like

Let’s chalk it up to bad writing.

In the past, too, he has not “judged” evolutionary biology. He has stated, accurately, that it made no sense to him. That was true. No one had, I found out, actually sat down and explained it to him.

Well, Tour is a synthetic organic chemist so it’s probably not that much of an insult :slight_smile: I have higher expectations of biochemists :wink:

I really do get the concern about chemists. I feel like we’re kind of left out of the discussion so much of the time. It feels like the physicists and biologists get the monopoly on the “big questions” so it’s natural for those of us interested in big questions to want to be a part of the conversation. Tour has done a lot of very good and important work in nanotechnology, but I do think scientists have to be careful to not speak confidently out of our discipline.

I’m just trying to learn from the experts at this point, I want to know more about what it means to be human.


2 posts were merged into an existing topic: Comments on James Tour on Orphan Genes

Ok, so I’ve taken a look at the data from Ruiz-Orera et al. (2015):

They identified 634 candidate de novo genes in the human genome, based on the fact that they found 1,029 transcripts in humans that weren’t present in chimpanzees.
They link to a GTF file containing the information about those human-specific transcripts here: http://dx.doi.org/10.6084/m9.figshare.1604892
but note that this file contains both the human-specific transcripts found in humans, as well as the hominoid-specific transcripts found in humans. This wasn’t apparent to me at first, and I think @roohif missed it too, as he included the entire file in his analysis. I separated out the species-specific transcripts only (1,029).

A second complication is that the GTF file contains separate entries for different exons and CDSs in each transcript, so while there are only 1,029 transcripts (1,029 transcript IDs), there are ~4,000 individual sequences specified in the file. As these 1,029 transcripts correspond to just 634 candidate genes, there is a certain amount of overlap between some transcripts. For example, transcript 1 and transcript 2 might both contain the exact same exon 1, but have different exon 2s. In this case, there is 2 entries for that exon 1 in the GTF file. In other words, there are a few duplicated sequences in there.

Anyway, now to the analysis. I used @roohif’s code, described in his blog post:

It takes the coordinates from the GTF file and extracts the corresponding sequences in the human genome (I used assembly GRCh37.p13, since that’s what Ruiz-Orera et al. used as their reference), giving a series of multi-fasta files - 1 for each chromosome (Sequences less than 30bp long were excluded because they’ll find matches just by chance). Then the sequences in these FASTA files are BLASTed against the chimp genome (I used the latest assembly: PTRv2).

The results are .csv files containing all the vital statistics about each individual BLAST search. Each row looks something like this:

qseqid qstart end sseqid sstart send qlen length nident pident evalue
263 1 71 NC_036880.1 94512337 94512407 71 71 71 100.00 1,74E-30

In this case, you can see that the query sequence was 71bp long (qlen), and a match was found in the chimp genome that covered all 71bp of the query (length), and that 71bp (out of 71bp) of this chimp sequence matches the human query sequence (nident), meaning that the percentage identity is 100% (pident).

There are 3973 rows like this in the final .csv file, covering all of the sequences that make up the 1,029 transcripts. Here are some general statistics:

The total number of bases analysed was 843,105.
Of these, 833,023bp were included in BLAST hits in the chimp genome.
820,460 bases were identical between the human and chimp genomes, meaning that 97.3% of the total analysed bases were found to have a perfect match, and 98.5% of the bases that were included in the BLAST hits have a perfect match.

BLAST reports hits for subset of sequences, so in some cases the query sequence could be 1000bp long, and BLAST could find a 100% match for 100bp of this sequence, and a 0% match for the other 900bp. That’s not very helpful for our purposes, so, I didn’t rely much on pident. Instead, I divided the number of identical bases in the BLAST hit (nident) by the total number of bases in the query sequence (qlen). In the example earlier, this would return a match of 10%, which is more representative.

The average of all the sequences is 97.3% (as I mentioned earlier). The distribution of these percentages is shown below: 3773 out of 3973 sequences had at least 95% of their bases identical in the chimp genome. Of the remaining 200, just 96 sequences had below 90%, and 56 sequences had below 50%.


Of these, how many do you find strong matches for different halves of the sequences, perhaps indicating rearrangements?

1 Like

The average pident of these 96 sequences was 93.1%, but the BLAST search was done using “-max_hsps 1”, so only 1 BLAST result was returned per query, making it difficult to infer rearrangements.

1 Like

Thanks for working on this, @evograd. I certainly appreciate the effort. @swamidass, what does your intuition tell you about those 56? Do you suspect a large portion would be due to recombination?

1 Like

35 posts were split to a new topic: Comments on James Tour on Orphan Genes

(from Guayaquil, Ecuador – may lose internet access soon)

This is the paper Jim Tour was referring to:

In particular, this argument by Clamp et al.:

The Clamp et al. (2007) paper is now 12 years old, so should be revisited in the light of better annotations for both human and chimp genomes. If anyone reading this wants to try their luck at getting the Clamp et al. data, please give it a shot. When our ORFanBase & ORFanID research group tried some of the data links in the paper (several months ago), they had gone dead.

Also of interest – this group working on human ORFans at Iowa State:

This presentation occurred in the Comparative Genomics and Evolution track at the ISMB / ISCB annual meeting last summer in Chicago. These researchers may be understandably protective of their preliminary data at this point, however.

Above, T. aqua pointed out that syntenic sequence often exists in another species (call it X) for a reading frame annotated as coding in species Y. These aren’t really orphans by the strictest criteria, and indeed raise a puzzle about the arrow (direction) of causality: might sequence X, which may lack a promoter, actually represent a once-transcribed and functional sequence now drifting away into pseudogene status (whereas the Y sequence is still transcribed and producing a functional product)?

1 Like

If I am following this discussion correctly (far from a given), the analysis of the Ruiz-Orera paper by evograd above goes a long way towards addressing this.

1 Like

Looks like @evograd has already done the heavy lifting.


That may not be the case for all of them.

How many were discarded?

If I am reading it correctly, the started with 352 candidates in the human genome, and of those, 66 had orthologous DNA in the chimp and orangutan genome. This is from Fig 1. in the paper:

My intuition is that either sequencing errors/omissions, deletions, or rearrangements broke or affect these genes in the chimpanzee or human genome. Each of these hypothesis are testable in data. It is hard to do in an automated system (but maybe @NLENTS has one in the works?). With this few a number, manual inspection might resolve it for all of them.

Put this in perspective too.

There are about 20,000 genes in the human genome. This is ~0.25% of the genes in the human genome.

There are were 1777 genes initially being claimed as “orfans,” so 56 is just 3.1% of these.

Either way you look at it, these genes are the exceptions of the exceptions. They are so rare, their identification as having “no homologue” in @evograd’s analysis is invalid, and is likely to be an artifact of the analysis. Anyone who reports these 1777 genes having no homologues in chimpanzees does not understand the data or is misrepresenting the data.