How does the large diversity in humanity’s genome get used in a reference genome? It seems like comparing everyone with an “average” or “special” genome wouldn’t be that useful. What am I missing?
The reference genome is a way of compressing and aligning genomes. When we get a bunch of genetic data, we need to know what part of the genome it maps too, and in what ways it is different from the reference. We then, often, just store the ways in which our new genome is different from the reference. If we can’t align a read to the reference genome, we might just throw those reads away.
So expanding the reference makes it less likely we will throw genetic data away, and making it more “average” might make our compressed down genomes (the vcf file, which only stores the differences) more compact. Of these two things, the first one is most consequential.
For some parts of the genome with the most complicated large-scale variation, alternative reference segments are available as part of the reference.
Who wants to try and align these new sequences to the Chimp genome assembly?
Where is @roohif when we need him? Also, remember we have new ape genomes now too, constructed without a human reference. Great opportunity to test Thompkin’s Hypothesis about human and chimp similarity.
I’m happy to give it a crack! Pretty snowed under at the moment (preparing to move house right before Christmas!) so won’t really get much of a chance until January. I’ll stick a reminder for myself in the calendar 