The coronavirus outbreak has been accompanied by a huge amount of sequencing data, as well it should be. Nextstrain.org is a great place to see this in action: region by region, the spread of the infection can be tracked, often with enough detail to say where the virus must have come in from and how many different starting points it’s had. That all depends on how many different strains are detected in the first place, of course, and in regions (like the US!) where we’re not even doing enough basic RT-PCR swab tests to know the prevalence of the virus as a whole, we’re surely missing a lot of information about deeper things like viral sequence, the number of different mutations, and how they’re distributed. GISAID is another large repository of such data, and it’s growing day by day.
The historical example that this inevitably calls to mind is from World War II: mathematician Abraham Wald was given the job of analyzing the patterns of damage seen in returning combat planes, with an eye to where armoring could be improved. The initial idea was that the areas with the most holes were perhaps getting hit often and should be shored up – but Wald pointed out the survivorship bias problem: these places actually indicated where a plane could take damage and still be able to return . He believed that the distribution of projectiles was probably fairly even, meaning that regions on the aircraft where no shell holes were ever detected were probably the crucial ones to armor! (Note: the accounts of this have been embellished over time, but the fundamental story is accurate – see this post at the American Mathematical Society about the math behind Wald’s work, and note especially the postscript) . In those plots above, we are seeing the places that you can shoot through the coronavirus genome and still return a working pathogen.