This surprised me from the article:
Specifically, the world’s reference genome was assembled from the nucleic acid sequences of a handful of anonymous volunteers. Other researchers later determined that 70 percent of the reference genome derives from a single individual who was half European and half African, and the rest derives from multiple individuals of European and Chinese descent, according to Salzberg.
My opinion means nothing but I would have thought that a genome database should include DNA from the San people (some of the oldest and most divergent Y-chromosome and mitochondrial DNA haplogroups.)
@AJRoberts will love this.
This is a low level advance but fundamental for the field. The human reference genome is like the Kilogram for physics, and we need to get it right.
A good question to ponder: what percentage of the genome is 300 million bits? Thoughs paying attention should have a very good estimate without reading the paper.
About 10%?
How did you compute that number?
300 million out of 3 billion = 10%
Depends on whether they mean 300 million pieces of DNA or 300 million bits of information.
Math error here. You forgot a unit conversion. In this context it is best to declare your assumptions too.
@glipsnort I didn’t even think of that meaning too. @AndyWalsh will be vectoring on that one.
Judging from the linked piece, they mean base-pairs.
I agree, I think the author mean bits as in portion of the genome and not as the unit of information. It seems that the reference Genome doesn’t have 10% of the human genome in it. And that remaining 10% could be filled in by looking at the genomes of Africans.
Here is the actual paper, it is base-pairs and number of reads over the genome.
https://www.nature.com/articles/s41588-018-0273-y
Up to 300 Million base pairs are missing from the reference genome.
I may be getting confused but it seems to me that 3 billion base pairs requires 6 billion bits of computer storage because each base pair is one of four possibilities (ATGC), so that’s 2 bits for each.
Whatever the case, “bits of DNA” seems potentially confusing.
Perhaps it is a matter for CPAs and they are missing $37.5M in grant funding.
But the real question is: how much bigger would the human genome need to be before its entropy eclipses that of the Sun?
The paper had nothing to do with “bits”. It was poor reporting by Science Digest. The paper has to do entirely with base pairs.
Then this looks just about right:
The paper is very good. It really shows how incomplete the “reference” genome is. And how the reference genome lacks diversity. By adding in diversity from African Genomes and from other parts of the world, we can start doing analysis with all of the diversity that we have from being admixes of admixes of admixes.
Meanwhile, I was under the impression that the human genome was something like 3.3 million base pairs. I’m not trying to be pedantic but if we are concerned with 10% out of the total number of base pairs, then that 0.3 million would seem quite significant.
Or did I remember incorrectly about the 3.3 million? (No, I haven’t read the article yet because Joshua’s challenge was “without reading the article” or whatever.)
Yeah, well, sorta. The problem is that there really is no such thing as “the human genome”. The reference genome is missing a bunch of bits that some humans have.
Heh – this takes me back. I may have been the first person to make that estimate, or at least the first person known to those running the project. Eric Lander asked me to figure out what the ancestry of the primary donor to the reference genome was. According to an email I sent him in 2002, I came up with two estimates (based on small sets of informative markers): (1) 60% African, 40% non-African, and (2) 55% African, 45% non-African. Others did a much better job later, of course.
This is the kind of fascinating, first-hand anecdote which brings me daily to the Peaceful Science forum. (One sure doesn’t find this kind of post on the AIG Facebook page.)
Collectively, the PS posters have had a great many interesting experiences in life.
The human genome is estimated to be over 3 billion base pairs but there are still sequencing gaps in it. This paper found as much as 300 million bases pairs isn’t in the reference genome which is more European that it should be. That will biases the analysis using the reference genome. What is needed is a lot of reference genomes over time and location.
I"m not sure about this. That is not how the reference genome is used.