James Tour on Orphan Genes

Ok, so I’ve taken a look at the data from Ruiz-Orera et al. (2015):

They identified 634 candidate de novo genes in the human genome, based on the fact that they found 1,029 transcripts in humans that weren’t present in chimpanzees.
They link to a GTF file containing the information about those human-specific transcripts here: http://dx.doi.org/10.6084/m9.figshare.1604892
but note that this file contains both the human-specific transcripts found in humans, as well as the hominoid-specific transcripts found in humans. This wasn’t apparent to me at first, and I think @roohif missed it too, as he included the entire file in his analysis. I separated out the species-specific transcripts only (1,029).

A second complication is that the GTF file contains separate entries for different exons and CDSs in each transcript, so while there are only 1,029 transcripts (1,029 transcript IDs), there are ~4,000 individual sequences specified in the file. As these 1,029 transcripts correspond to just 634 candidate genes, there is a certain amount of overlap between some transcripts. For example, transcript 1 and transcript 2 might both contain the exact same exon 1, but have different exon 2s. In this case, there is 2 entries for that exon 1 in the GTF file. In other words, there are a few duplicated sequences in there.

Anyway, now to the analysis. I used @roohif’s code, described in his blog post:

It takes the coordinates from the GTF file and extracts the corresponding sequences in the human genome (I used assembly GRCh37.p13, since that’s what Ruiz-Orera et al. used as their reference), giving a series of multi-fasta files - 1 for each chromosome (Sequences less than 30bp long were excluded because they’ll find matches just by chance). Then the sequences in these FASTA files are BLASTed against the chimp genome (I used the latest assembly: PTRv2).

The results are .csv files containing all the vital statistics about each individual BLAST search. Each row looks something like this:

qseqid qstart end sseqid sstart send qlen length nident pident evalue
263 1 71 NC_036880.1 94512337 94512407 71 71 71 100.00 1,74E-30

In this case, you can see that the query sequence was 71bp long (qlen), and a match was found in the chimp genome that covered all 71bp of the query (length), and that 71bp (out of 71bp) of this chimp sequence matches the human query sequence (nident), meaning that the percentage identity is 100% (pident).

There are 3973 rows like this in the final .csv file, covering all of the sequences that make up the 1,029 transcripts. Here are some general statistics:

The total number of bases analysed was 843,105.
Of these, 833,023bp were included in BLAST hits in the chimp genome.
820,460 bases were identical between the human and chimp genomes, meaning that 97.3% of the total analysed bases were found to have a perfect match, and 98.5% of the bases that were included in the BLAST hits have a perfect match.

BLAST reports hits for subset of sequences, so in some cases the query sequence could be 1000bp long, and BLAST could find a 100% match for 100bp of this sequence, and a 0% match for the other 900bp. That’s not very helpful for our purposes, so, I didn’t rely much on pident. Instead, I divided the number of identical bases in the BLAST hit (nident) by the total number of bases in the query sequence (qlen). In the example earlier, this would return a match of 10%, which is more representative.

The average of all the sequences is 97.3% (as I mentioned earlier). The distribution of these percentages is shown below: 3773 out of 3973 sequences had at least 95% of their bases identical in the chimp genome. Of the remaining 200, just 96 sequences had below 90%, and 56 sequences had below 50%.

7 Likes