I just published a blog post on this subject, citing 3 lines of evidence pointing to mutations as the cause of intra- and inter-species differences. The subject of this thread is #2.
Really good. Did you use @glipsnort’s graphs or do some of your own? If the former, be sure to cite him. If the later, can you share source code in git?
It’s the latter. I basically re-did the relevant parts of @glipsnort’s analysis on a larger scale (with his help). I can compile a document describing all the methods and give all the code I used (with Steve’s permission, since he gave me some of his), but to be honest it’s embarrassing how janky it is. It works, but it ain’t pretty. A Frankenstein’s monster of perl, python, and parsing using a long series of individual bash commands. Unfortunately, I’m not good enough with this sort of thing to be able to make a single executable script that takes the raw .vcf file as input, does it’s thing, and then gives the mutation counts as an output.
I think building this into an easy to use and understand module has high importance for the conversation. It would give people a chance to look at the data for themselves. Perhaps we make a “summer of code” project for Google? Whatever the case, I think this is important.
I’d be happy to work with any of the more coding-savvy members of the forum to transform my mess of a pipeline into a streamlined, easy to use piece of code that people with minimal expertise could use.
First step is make it available. If it is shamefully bad, put it in bitbucket as a private repository and share.
Sure, I’ll work on that and see if I can share it here in the next few days.
I really liked this figure:
An r-squared of 0.993 is freaking amazing, as is the extension of the data into 3 base sequences. One of the problems I saw in the graphs in my thread was the lack of statistical analysis, so it is really awesome to see it here.
@AndyWalsh this is another place where some coding and simulation work would be really helpful, to allow average people to understand this graph.
In the name of community peer review, there are a few nits to pick in this paragraph from the ancestral allele section:
It is also possible that the two alleles were heterozygous in the common ancestor of chimps and humans. One allele reached fixation in the chimp lineage but the alleles remained heterozygous in the human lineage. As you mention, when you incorporate other ape genomes it can clear a lot of the noise. It may be a bit confusing, but you could work ILS into the discussion as well. @swamidass is always on me for not mentioning the sources of noise in phylogenies, so I thought I would continue the PS tradition.
I’d be happy to work with any of the more coding-savvy members of the forum to transform my mess of a pipeline into a streamlined, easy to use piece of code that people with minimal expertise could use.
I can’t promise that I’m any more savvy than you are (my first response when I see perl is to run in the opposite direction), but I’m happy to attempt to help with this.