Endogenous Retroviruses and Common Descent

There are many types of genetic evidence that support common ancestry. One of those pieces of evidence is the remnants of past retroviral infections that are shared by different species.

What are retroviruses, what are these remnants, and why do they point to common ancestry? First, let’s get a handle on what retroviruses are.

Retroviruses are viruses with an RNA genome that is reverse transcribed into DNA (hence the retro portion of the name) and then inserted into the human genome. There are strong promoters in the inserted viral genome that force the host cell to transcribe the viral genome back into RNA and to also translate some of that viral genome into the proteins needed for new viral capsids. The virus works its way to the cell surface, detaches, binds to a new cell, injects its genome into the cell, and the process repeats.

So where in the host genome do retroviruses insert. As it turns out, all over the place. A group of scientists infected human cells with 3 different retroviruses, and then they mapped where in the human genome these viruses inserted. This is their data:

Each bar is a chromosome, and each lollipop marker is a mapped viral integration. There were insertions down the length of every chromosome. Some viruses did show preference for general features, but these areas cover massive portions of the genome:

"For HIV the frequency of integration in transcription units ranged from 75% to 80%, while the frequency for MLV was 61% and for ASLV was 57%. For comparison, about 45% of the human genome is composed of transcription units (using the Acembly gene definition). "
reference above

So for HIV, that virus favors about 1.5 billion bases out of the 3 billion base genome. Even then, it inserts into the unfavorable part of the genome about 20% of the time.

To sum up, we can directly observe that retroviruses insert into the host genome, and they do so all over the place. In the next post, we will go over the total number of ERVs in the human genome, how we know they are the result of viral infections, and how many of those insertions are shared with the chimp genome.


If a retrovirus creates an insertion in the genome of an egg or sperm then it has a chance to be passed on to the next generation. Those insertions are called endogenous retroviruses. How many are found in the human genome? According to the 2001 human genome paper, there are over 200,000 endogenous retroviruses in the human genome (ERV-classI-III):


How do we know that they are endogenized versions of retroviruses? Because that is what they look like, and that is how they act. A viral genome will have flanking long terminal repeats that act as promoters (LTRs), and then viral genes like reverse transcriptases and capsid proteins between the LTRs. Many ERVs are solo LTRs due to recombination between the similar sequence at the bookends of the viral genome, but we also have matching sequence between viral LTRs and ERV LTRs, and we have a known mechanism to produce them.

Also, if you line up a bunch of closely related ERVs from the human genome and remove the mutations from them you get a functional retrovirus:

Here is where we get to the evidence for common ancestry. Johnson and Coffin put it best in their paper:

Since retroviruses insert all over the place the chances that two insertions will happen at the same place in two separate genomes is close to zero. If we are talking about hundreds of thousands of ERVs, this probability gets really, really close to zero.

So how many ERVs do we share with chimps? More than 99% of them. Of the 200,000+ human ERVs, less than 100 are not found in the same place in the chimp genome. Of the 200,000+ chimp ERVs, less than 300 are not found at the same place in the human genome. This info can be found in the chimp genome paper:


This is smoking gun evidence for common ancestry. We have a process that creates genetic markers all over the place, and two genomes that share more than 99% of those markers. Independent insertion of retroviruses can’t explain this pattern. The only observed and evidenced mechanism we have for explaining this pattern is viral insertion into the genome of a shared ancestor.


Excellent posts @T_aquaticus. The science here is solid.


T_aquaticus - What is the largest possible number this can be? Can you share the math? We are doing research on this exact topic for an educational video we have been working on for months now.

1 Like

Nice to meet you @ColtCorrea. Tell us about yourself.

Sorry If I posted this twice.

Hello Swamidass,

At 50 years old, I spent most of my life as a fundamentalist YEC Christian. For the past year, I have spent reading and researching and reading books on common descent including your book.

Here is my linked in:

I have funded the making of a video that gives an overview of the ERV evidence for common descent and we need the % chance that ERVs shared between Chimps and Humans happened by chance (that is without common descent).

Can you help?

It takes a lot of assumptions to come up with a probability estimate. But suppose an ERV can insert anywhere, i.e. between any two bases in the genome, with equal probability. That gives 3 x 10^9 possible insertion sites. Now of course that isn’t true. Insertions at some sites will be strongly deleterious and will be eliminated by selection, but that would likely be only a few percent of sites. The probability of insertion is not invariant across the genome either. And it might not be possible to distinguish insertions within one or two bases of each other, given the possibility of further mutations making alignment ambiguous. But I suspect a rough calculation will not be too wildly off. The probability that a given insertion will happen at the same spot in two species is around 1 in 3 x 10^9. The probabiity, given n insertions, that no two will happen at the same spot would seem to me to be (1 - (1/(3 x 10^9)))^n. Anybody want to disagree? Unfortunately, I have no way to evaluate that number, given the limits of floating point math.

Everyone - Thank you for this information and interacting here. I know it is complicated with many assumptions. We are hoping to give a “largest possible % chance” given reasonable assumptions.

In many books and online, scientist say “the ERV evidence is overwhelming” and as this post says “probability is very very near zero”. It would be great to have the largest possible number. Saying “very very near zero” implies “we don’t know what we are doing”. It is too vague.

Here is a good reference on the subject (and even cited by some creationists who are under the impression it supports their arguments):

The authors did find one 500 bp region that saw a 280 times increase in insertions that would be expected from random, so let’s use this as the best case scenario for the creationist argument for orthologous insertions being caused by independent insertions. I will use the numbers most favorable to creationists within any ranges given.

The random rate in the paper is 3 insertions in each orientation in any 500 bp region for a total of 6 insertions per 500 bp. This is out of a total of 10 million insertions total. Therefore, the probability of the insertions happening at the same base is (6/1E7)500, or 1 in 1.2E9 insertions (which is what you would expect for a genome around 2 billion bases). The authors did discovery one 500 bp sequence that saw 280 times higher rate of insertion than would be expected from random, so that would be 280 insertions for every 1.2E9, or 1 in 23 million. My math and model may be a bit off, but nowhere near the amount to make a difference (see below).

I remember reading other papers that saw much higher rates of shared but independent insertions, around the ballpark of 1 in every 40,000. Can’t seem to find those at the moment, but others may be able to find them.

Nonetheless, no one has ever seen anything approaching the rates creationists would need. Insertion preference would need to be multiple orders of magnitude stronger in order to explain why nearly all of the 200,000 ERV’s in the human genome are found at the same position in the chimp genome. From what we have observed of retroviruses, the chances of independent insertions producing the ERVs we see in the human and chimp is so large that it would have 1,000’s of zeros after it.


Actually, the relevant number wouldn’t be the probability that at least one pair of convergent insertions happened. It would be the probability that the matching insertions are all convergent. If we consider the human genome insertions as a target template and the chimp genome as a series of darts thrown at that target, we would like to know the chance that all the darts hit a bullseye. For the first dart (insertion) the probability would be n/(3 x 10^9); for the second dart, (n-1)/(3 x 10^9), since hitting the first dart’s target wouldn’t count. That would make the probability of every dart hitting equal to n!/((3 x 10^9)^n. Again, way too small for me to calculate.

However, there are a few known cases of convergent insertions. See for example Han, K.-L., E.L. Braun, R.T. Kimball, S. Reddy, R.C.K. Bowie, M.J. Braun, J.L. Chojnowski, S.J. Hackett, J. Harshman, C.J. Huddleston, B.D. Marks, K.J. Miglia, W.S. Moore, F.H. Sheldon, D.W. Steadman, C.C. Witt, and T. Yuri. 2011. Are transposable element insertions homoplasy free?: An examination using the avian tree of life. Systematic Biology , 60: 375-386.


Way to small to calculate? Can we make a comparison? Compared to the number of atoms in the known universe or number of seconds since the big bang?

Is there a suggestion of what we could put in our educational video?

Another classic paper cited in these discussions:

They saw thousands and thousands of insertions spread across all chromosomes and along the lengths of all chromosomes. For the creationist argument to work we would need to see nearly all of them occurring in just a handful of spots, but that’s not the case.

The important thing to note is that scientists aren’t assuming retroviruses insert randomly into the genome (with slight preference for some features that are found in large portions of the genome). Scientists observe that retrovirus insertion is random.


Let’s say the chances of two insertions occurring at the same base is 1 in 10,000 which is a very generous probability. The probability of nearly all 200,000 insertions occurring at the same base would be 1 in 10,000^200,000. Needless to say, 10,000 to the power of 200,000 is pretty big.


Hi @T_aquaticus I just skimmed your post for now. I looked to see if any creationists had written about these. I’m just going to link these for now, and I’ll come back to this later. It looks interesting. If you or anyone else wants to respond to any, feel free. Otherwise I’ll see if I have specific questions when I have time to read.

Second question here is a follow-up to the above article: https://creation.com/living-fossils-erv-function

AIG had more articles - I’ll just link three

Have to scroll to the specific section.

I am curious about this. What would be the explanation? It was lost in chimps?

The ERVs are not always consistent with evolutionary expectations. For example, scientists analyzed the complement component C4 genes (an aspect of the immune system) in a variety of primates.19 Both chimpanzees and gorillas had short C4 genes. The human gene was long because of an ERV. Interestingly, orangutans and green monkeys had the same ERV inserted at exactly the same point. This is especially significant because humans are supposed to have a more recent common ancestor with both chimpanzees and gorillas and only more distantly with orangutans. Yet the same ERV in exactly the same position would imply that humans and orangutans had the more recent common ancestor. Here is a good case where ERVs do not line up with the expected evolutionary progression. Nonetheless, they are still held up as evidence for common ancestry.

1 Like


Incomplete linage sorting is most likely. These are a minority of cases that are expected not to follow the overall pattern.

Are their any good studies that show what proportion of ERVs fit the pattern, and what proportion do not?


Hopefully you’re not asking me :slight_smile: I think you hit Submit too early as that “what” is lonely.

I think you did too. Instead of adversarially citing things you haven’t even bothered to read first, why not start with the evidence itself?

The overwhelming majority fit the pattern in primates:

We screened a total of 3,410,175 Alu -like presence/absence patterns among the three lineages, human (1,167,145 loci), chimpanzee (1,160,650 loci), and rhesus macaque (1,082,380 loci). Selecting the setting Display Perfect in the results part of GPAC yielded the following phylogenetic signals that contradicted the established HCR (⁠++−)++−) relationships: 450 cases for HCR (⁠+−+)+−+) and 538 cases for HCR (⁠−++)−++) (Fig. 2). We then manually inspected all loci and added data from blast screens of related species and consensus sequences of diagnostic Alu elements to identify Alu types to help confirm our analysis. Of these, we identified 22 cases of true homoplasy: nine precise parallel insertions, 12 precise deletions (Fig. 3), and one locus with a complex evolutionary scenario that we cannot clearly attribute to either parallel insertion or deletion.

Thus, in accordance with general knowledge (e.g., [Perelman et al. 2011](javascript:;)), our analyses revealed high support for the human–chimpanzee sister group relationships (HC:HR:CR 27,327:0:0; KKSC insertion significance test P<10−298P<10−298⁠; zeros reflect the absence of ILS and ancestral hybridization signals in the HCR group, according to our assumption; [Kuritzin et al. 2016](javascript:;)). Given the above data, we estimated the frequency of precise parallel insertions to be 0.01% in both human—rhesus macaque and chimpanzee—rhesus macaque pairs (4/(4215 ++ 33,954 ++ 4) ×× 100% and 5/(1881 ++ 33,954 ++ 5) ×× 100%, respectively). We estimated the frequency of precise deletions in human to be 0.001% (3/(544,034 ++ 3) ×× 100%) and in chimpanzee to be 0.002% (9/(544,034 ++ 9) ×× 100%).


I didn’t say it was way too small to calculate, just way too small for me with the software available to me. If we assume that n! contributes little, then how about 1 in 10^9n as a first approximation? If n is 5000 that’s 1 in 10^45,000.