Maybe things can be understood by looking at how the set of positively-selected genes was arrived at:
We compared levels of polymorphism and divergence between the 18 polar bear and 10 brown bear samples to identify genes under positive selection in the polar bear lineage. We analyzed a total of 19,822 genes. We focused our analyses only on the coding regions of each gene.
Homogeneity Test for Reduction of Polymorphic Sites
Under neutral evolution, we expect the amount of within- and between-species diversity to be correlated. If, however, selection events occur on the branch of a gene, we expect a reduction of polymorphic sites in one species only, compared to the levels of genetic variation and divergence in the other.
We therefore recorded the following quantities for each gene:
A = number of polymorphic sites in the polar bear samples;
B = number of polymorphic sites in the brown bear samples;
C = number of fixed differences between polar bear and both brown bear and the giant panda sequence;
D = number of fixed differences between brown bear and both polar bear and the giant panda sequence;
We performed a homogeneity test for the null hypothesis A/C = B/D equivalent to A/B = C/D. We used a Pearson’s chi-square test on the 2x2 contingency table. We notice that because of LD, the resulting ‘P-values’ are not accurate and do not have the interpretation expected under the multinomial model underlying the assumption of the Pearson’s chi-square test. The issue resembles the well-known issues relating to HKA tests in population genetics, in which simulations under specific demographic models are needed to assign P-values. In this paper, we do not attempt to present valid P-values based on demographic simulations, due to concerns regarding underlying parameters such as recombination rates. Instead, we only provide ranked lists of genes and rely on enrichment analyses, and arguments regarding lack of symmetry between polar and brown bears, to provide statistical evidence for an effect reflecting what could be expected by chance in the absence of selection. To avoid misunderstandings regarding the interpretation of the ‘P-value’, we convert them to a score using -log10(P-value). We report this score rather than the ‘P-value’ itself to avoid misunderstandings of the interpretation of this statistic
Distribution of the homogeneity test scores for the top-50 genes in polar bear and brown bear are presented in Figure 4A. We also computed the expected distribution of homogeneity test scores in polar bear and brown bear under neutrality (Fig. 4B), using the demographic model estimated with the IBS tract method (Table S3).
Hudson-Aguade’-Kreitman (HKA) Test
We also performed the Hudson-Aguadé-Kreitman (HKA) (Hudson et al., 1987) test on each gene to verify that selection acted specifically on the polar bear lineage and not on the brown bear lineage. The rationale behind this test is that polymorphism levels depend on local mutation rates, measured from divergence values using an outgroup species, in our case the giant panda sequence. The HKA test is commonly used to verify this expectation and tests whether the decrease of polymorphism observed at a locus is due to positive selection and genetic hitchhiking.
We performed the HKA for polar bear by comparing the ratio of A/C (Extended Experimental Procedures Section Homogeneity Test for Reduction of Polymorphic Sites) for each gene to the genome-wide average, computed as the sum of A and C values across all genes analyzed. We therefore tested for the null hypothesis A(gene)/C(gene) = A(genome-wide)/C(genome-wide). We used a Pearson’s chi-square test on the 2x2 contingency table. We similarly tested for the null hypothesis C(gene)/D(gene) = C(genome-wide)/D(genome-wide) for the brown bear lineage. As in the previous case, we converted the ‘P-value’ to a log transformed score and avoid a probabilistic interpretation of this score.
Population Genetic Differentiation
We computed a measure of population genetic differentiation, FST, between species. The rationale for this test is that adaptive differentiation can be captured from differences in allele frequencies between polar bear and brown bear. Outliers in the FST distribution indicate positive selection. We computed a method-of-moments estimator of FST (Reynolds et al., 1983) for all tested genes.
(skipping a section about analysis of NGS data)
Identification of Genes Under Positive Selection in the Polar Bear Lineage
We ranked all tested genes based on their homogeneity test score. We did not perform this test for genes with values of both C and D below 1. To characterize signatures of natural selection in the polar bear lineage, we ranked genes based on their homogeneity test score and imposed that the ratio of C/A was greater than D/B. To characterize selection in the brown bear lineage, we ranked genes based on their homogeneity test score and imposed that the ratio of C/A was less than D/B.
We specifically aimed at identifying genes under positive selection in the polar bear lineage. We only retained genes with a significant p -value of the HKA test in the polar bear lineage (nominal p -value < 0.05), but not in the brown bear lineage (nominal p -value > 0.05). In this way we ensured that the inferred selection was specific to the polar bear lineage. We also selected only genes showing high differentiation between polar bear and brown bear samples, specifically with F ST > the 90th percentile of the empirical distribution obtained across all genes; the F ST distribution had a mode of 0.6. We finally ranked the remaining genes based on their homogeneity test score. Values for the three statistical tests performed for each gene are provided in Table S6.
The strength of positive selection is greater in the polar bear lineage than in the brown bear lineage. Indeed, the distribution of the homogeneity score for genes showing an excess of fixed mutations in the polar bear lineage (C/A > D/B) was largely greater than the distribution of the homogeneity test score for genes showing an excess of fixed mutations in the brown bear lineage (C/A < D/B) (Fig. 4A). Under neutrality, we show that there is only a marginal difference in the distribution of the homogeneity test score when accounting for differences in sample size and population size between polar and brown bears (Fig. 4B). For this purpose, we simulated 50 loci under the estimated neutral demographic model (Table S3), imposing a locus-specific mutation rate calculated from the observed data of polar and brown bear from the top 50 genes (Fig. 4A,B).
Analysis of Genes under Positive Selection in the Polar Bear Lineage
We checked whether genes under selection in the polar bear lineage were enriched with SNPs associated with metabolic traits and diseases from human genome-wide association studies. We obtained a catalog of reported genes associated to a metabolic traits or diseases from http://www.genome.gov/gwastudies/ (Hindorff et al., 2009) (updated on 02/03/2013). We sampled with replacement an equal number of genes and counted the frequency of association with a metabolic traits or diseases.
We finally checked whether fixed missense mutations specific to the polar bear lineage, compared to the brown bear and panda lineage, were associated with human conditions from the Human Gene Mutation Database http://www.hgmd.cf.ac.uk.
We find no fixed missense mutations specific to the polar bear lineage associated with human diseases according in the Human Gene Mutation Database. However, the top 20 genes are significantly enriched with genes previously associated with metabolic diseases and traits and humans ( p -value = 0.042) from the GWAS catalog, discussed in the main text.
We predicted the functional impact of polar bear protein substitutions, compared to the ancestral state defined as the common amino acid shared by brown bear, panda, and human protein sequences. We used PolyPhen-2 (PolyPhen-2: prediction of functional effects of human nsSNPs) (Adzhubei et al. 2010) to predict the impact of polar bear changes in human protein function. PolyPhen-2 generates posterior probabilities of substitutions that are damaging from chemical and comparative information, and summarizes such predictions into two scores: HumanDivPred and HumanVarPred. Substitutions are predicted to be “probably damaging” if the probability of being damaging is higher than 0.9, “possibly damaging” if it is between 0.5 and 0.9, and “benign” when it is lower than 0.500. We used pre-computed predictions from WHESS database (Whole Human Exome Sequence Space) and recorded the functional effect of 48 amino acid substitutions occurring in the top 20 genes under selection in the polar bear lineage (Table S7). Results show that a large proportion of polar bear substitutions (approx. 50% using the HumanDivPred score) are predicted to be functionally damaging in the human protein (Fig. 4C,D).
I find this a bit confusing, but it occurs to me that maybe the mutations shown in Table S7 are just the variants seen in the 18 polar bear samples? These would not necessarily be included in the NCBI accessions.
Comments, suggestions, any help would be appreciated.