Repeating analysis of human APOB

While studying how Polyphen-2 works, I noticed that one of its features is to use a sequence comparison to help assess the effect of a mutation in a human protein.

This made me wonder if the polar bear gene was included as a homolog in the analysis back before the 2014 Liu paper was published. Further I wondered if any new data has come out or any new updates would allow a new prediction by Polyphen for the exact same polar bear mutations. Basically, let’s repeat the Polyphen analysis for the following “damaging” mutations:

Gene Position Ancestral Polar bear HDivPred
APOB 716 N K possibly damaging
APOB 749 D E possibly damaging
APOB 2623 D N probably damaging
APOB 3920 T P possibly damaging
APOB 4418 L H probably damaging


It is predicted exactly the same, which makes it seem like nothing has changed since it is giving the exact same probability to 3 decimal places. We can see the homologs used in the alignment, but I did not find the polar bear as one of the homologs. This is curious since BLASTing these two proteins gives about 80% ID.

However, I noticed something when I compared this to my own alignment of the human APOB (NP_000375.2) and polar bear APOB (XP_008698812.1) for this first position

Gene Position Ancestral Polar bear HDivPred
APOB 716 N K possibly damaging


At position 716, the polar bear sequence from NCBI has “N” just like the human protein. Where is the fixed mutation? I checked all the other fixed mutations in Table S7 but none of them can be found in the sequence for polar bear APOB in NCBI (XP_008698812.1). The polar bear sequence in NCBI matches the human sequence at all these positions!

What am I missing? I’m sure this must be some mistake on my part? Can someone help?

@swamidass @T_aquaticus


It is possible, even likely, that PolyPhen 2 uses only a subset of genbank, perhaps only the human genome, or only the sequences at time of publication. Otherwise it would require recalibration on an ongoing basis.

Have you looked at the brown bear and panda bear sequences?

1 Like

Looking closer at that output, my instinct seems wrong.

I’m curious about the brown and panda bear sequences too.

I asked about the other two bear sequences because, after a quick look, it seems as if they chose sites that differed between bear sequences to run in PolyPhen2. Someone needs to check this, though.

1 Like
Gene Position Ancestral Polar bear HDivPred
APOB 716 N K possibly damaging
APOB 749 D E possibly damaging
APOB 2623 D N probably damaging
APOB 3920 T P possibly damaging
APOB 4418 L H probably damaging

Here are screen shots for the alignment at each position.
NCBI accesions are NP_000375.2,XP_002930154.1,XP_008698812.1,XP_026375362.1
Identity matrix (different ordering)



I found the same, with those sequences it seems that Polar Bears have a completely different set of fixed differences relative to Brown bears and pandas, with different PolyPhen effect scores:

Ancestral Polar Bear Position Effect Score
N S 390 Benign 0.029
V I 533 Benign 0.111
R G 845 Probably Damaging 0.986
M V 1193 Benign 0.000
F L 1258 Benign 0.002
I M 2515 Possibly Damaging 0.908
T N 3214 Probably Damaging 0.992
E R 4016 Benign 0.071

For the PolyPhen predictions here, I used the change from the human aa state to the polar bear state, although in 2 cases the human state was also different to the ancestral “bear” state.

(These fixed differences are in addition to the cluster of mutations in the first ~15 amino acids of the protein, which appear to be highly variable between even closely related species.)

The authors also say in their paper:

In contrast with brown bear, which has no fixed APOB mutations compared to the giant panda genome…

But I can clearly see in the alignment of these sequence that there are several brown bear-specific mutations relative to the giant panda (and polar bear) consensus.

Ancestral (Panda) Brown Bear Position
D G 268
L F 759
I T 836
D Y 3609
Q H 3992
1 Like

Maybe things can be understood by looking at how the set of positively-selected genes was arrived at:


We compared levels of polymorphism and divergence between the 18 polar bear and 10 brown bear samples to identify genes under positive selection in the polar bear lineage. We analyzed a total of 19,822 genes. We focused our analyses only on the coding regions of each gene.

Homogeneity Test for Reduction of Polymorphic Sites

Under neutral evolution, we expect the amount of within- and between-species diversity to be correlated. If, however, selection events occur on the branch of a gene, we expect a reduction of polymorphic sites in one species only, compared to the levels of genetic variation and divergence in the other.
We therefore recorded the following quantities for each gene:

A = number of polymorphic sites in the polar bear samples;
B = number of polymorphic sites in the brown bear samples;
C = number of fixed differences between polar bear and both brown bear and the giant panda sequence;
D = number of fixed differences between brown bear and both polar bear and the giant panda sequence;

We performed a homogeneity test for the null hypothesis A/C = B/D equivalent to A/B = C/D. We used a Pearson’s chi-square test on the 2x2 contingency table. We notice that because of LD, the resulting ‘P-values’ are not accurate and do not have the interpretation expected under the multinomial model underlying the assumption of the Pearson’s chi-square test. The issue resembles the well-known issues relating to HKA tests in population genetics, in which simulations under specific demographic models are needed to assign P-values. In this paper, we do not attempt to present valid P-values based on demographic simulations, due to concerns regarding underlying parameters such as recombination rates. Instead, we only provide ranked lists of genes and rely on enrichment analyses, and arguments regarding lack of symmetry between polar and brown bears, to provide statistical evidence for an effect reflecting what could be expected by chance in the absence of selection. To avoid misunderstandings regarding the interpretation of the ‘P-value’, we convert them to a score using -log10(P-value). We report this score rather than the ‘P-value’ itself to avoid misunderstandings of the interpretation of this statistic

Distribution of the homogeneity test scores for the top-50 genes in polar bear and brown bear are presented in Figure 4A. We also computed the expected distribution of homogeneity test scores in polar bear and brown bear under neutrality (Fig. 4B), using the demographic model estimated with the IBS tract method (Table S3).

Hudson-Aguade’-Kreitman (HKA) Test

We also performed the Hudson-Aguadé-Kreitman (HKA) (Hudson et al., 1987) test on each gene to verify that selection acted specifically on the polar bear lineage and not on the brown bear lineage. The rationale behind this test is that polymorphism levels depend on local mutation rates, measured from divergence values using an outgroup species, in our case the giant panda sequence. The HKA test is commonly used to verify this expectation and tests whether the decrease of polymorphism observed at a locus is due to positive selection and genetic hitchhiking.

We performed the HKA for polar bear by comparing the ratio of A/C (Extended Experimental Procedures Section Homogeneity Test for Reduction of Polymorphic Sites) for each gene to the genome-wide average, computed as the sum of A and C values across all genes analyzed. We therefore tested for the null hypothesis A(gene)/C(gene) = A(genome-wide)/C(genome-wide). We used a Pearson’s chi-square test on the 2x2 contingency table. We similarly tested for the null hypothesis C(gene)/D(gene) = C(genome-wide)/D(genome-wide) for the brown bear lineage. As in the previous case, we converted the ‘P-value’ to a log transformed score and avoid a probabilistic interpretation of this score.

Population Genetic Differentiation

We computed a measure of population genetic differentiation, FST, between species. The rationale for this test is that adaptive differentiation can be captured from differences in allele frequencies between polar bear and brown bear. Outliers in the FST distribution indicate positive selection. We computed a method-of-moments estimator of FST (Reynolds et al., 1983) for all tested genes.

(skipping a section about analysis of NGS data)

Identification of Genes Under Positive Selection in the Polar Bear Lineage

We ranked all tested genes based on their homogeneity test score. We did not perform this test for genes with values of both C and D below 1. To characterize signatures of natural selection in the polar bear lineage, we ranked genes based on their homogeneity test score and imposed that the ratio of C/A was greater than D/B. To characterize selection in the brown bear lineage, we ranked genes based on their homogeneity test score and imposed that the ratio of C/A was less than D/B.

We specifically aimed at identifying genes under positive selection in the polar bear lineage. We only retained genes with a significant p -value of the HKA test in the polar bear lineage (nominal p -value < 0.05), but not in the brown bear lineage (nominal p -value > 0.05). In this way we ensured that the inferred selection was specific to the polar bear lineage. We also selected only genes showing high differentiation between polar bear and brown bear samples, specifically with F ST > the 90th percentile of the empirical distribution obtained across all genes; the F ST distribution had a mode of 0.6. We finally ranked the remaining genes based on their homogeneity test score. Values for the three statistical tests performed for each gene are provided in Table S6.

The strength of positive selection is greater in the polar bear lineage than in the brown bear lineage. Indeed, the distribution of the homogeneity score for genes showing an excess of fixed mutations in the polar bear lineage (C/A > D/B) was largely greater than the distribution of the homogeneity test score for genes showing an excess of fixed mutations in the brown bear lineage (C/A < D/B) (Fig. 4A). Under neutrality, we show that there is only a marginal difference in the distribution of the homogeneity test score when accounting for differences in sample size and population size between polar and brown bears (Fig. 4B). For this purpose, we simulated 50 loci under the estimated neutral demographic model (Table S3), imposing a locus-specific mutation rate calculated from the observed data of polar and brown bear from the top 50 genes (Fig. 4A,B).

And finally:

Analysis of Genes under Positive Selection in the Polar Bear Lineage

We checked whether genes under selection in the polar bear lineage were enriched with SNPs associated with metabolic traits and diseases from human genome-wide association studies. We obtained a catalog of reported genes associated to a metabolic traits or diseases from (Hindorff et al., 2009) (updated on 02/03/2013). We sampled with replacement an equal number of genes and counted the frequency of association with a metabolic traits or diseases.

We finally checked whether fixed missense mutations specific to the polar bear lineage, compared to the brown bear and panda lineage, were associated with human conditions from the Human Gene Mutation Database

We find no fixed missense mutations specific to the polar bear lineage associated with human diseases according in the Human Gene Mutation Database. However, the top 20 genes are significantly enriched with genes previously associated with metabolic diseases and traits and humans ( p -value = 0.042) from the GWAS catalog, discussed in the main text.

We predicted the functional impact of polar bear protein substitutions, compared to the ancestral state defined as the common amino acid shared by brown bear, panda, and human protein sequences. We used PolyPhen-2 (PolyPhen-2: prediction of functional effects of human nsSNPs) (Adzhubei et al. 2010) to predict the impact of polar bear changes in human protein function. PolyPhen-2 generates posterior probabilities of substitutions that are damaging from chemical and comparative information, and summarizes such predictions into two scores: HumanDivPred and HumanVarPred. Substitutions are predicted to be “probably damaging” if the probability of being damaging is higher than 0.9, “possibly damaging” if it is between 0.5 and 0.9, and “benign” when it is lower than 0.500. We used pre-computed predictions from WHESS database (Whole Human Exome Sequence Space) and recorded the functional effect of 48 amino acid substitutions occurring in the top 20 genes under selection in the polar bear lineage (Table S7). Results show that a large proportion of polar bear substitutions (approx. 50% using the HumanDivPred score) are predicted to be functionally damaging in the human protein (Fig. 4C,D).

I find this a bit confusing, but it occurs to me that maybe the mutations shown in Table S7 are just the variants seen in the 18 polar bear samples? These would not necessarily be included in the NCBI accessions.

Comments, suggestions, any help would be appreciated.

1 Like

This is a great argument against these mutations being damaging. I wonder why they don’t mention this in the main text after the Polyphen-2 prediction. They basically argue against what Polyphen predicts but never explicitly say this.

The methods could explain the differences between NCBI accession and the paper, but in my mind “fixed” means it is in every individual or at least the genome that you deposit in NCBI! These differences mean that, at the very least, some polar bears have APOB very similar to brown bears although some polar bear individuals have a diverging APOB. This would suggest that their main adaptions to living in the arctic are in other genes (if not all polar bears have the fixed APOB mutations). Maybe we should be looking at the other fixed mutations in cardiovascular function, etc., as the primary adaption to arctic living. Doing an alignment for all the other genes mentioned in the paper and comparing the sequences deposited in NCBI for all of them sounds like work…

In case anyone wants to play with the APOB sequence that Liu et al. report in their assembled genome (amino acid sequence; this is not the final NCBI entry, as far as I can tell):