Larry Moran has a good blog post on scientific controls as they relate to detecting function in the genome. I think Moran has done a great job describing why controls are important, and what those controls could be when looking for function in the human genome. Larry Moran cites a paper by Sean Eddy:
What are your thoughts? Would random DNA sequence have function according to the criteria set out in the original ENCODE paper? Could random DNA sequence have actual function, and if so, what impact would it have on this control?
Of course random unselected DNA could also turn out, merely by chance, to have a useful biological function. Thatâs pretty much how junk-DNA evolves de novo into functional protein coding genes.
Iâm not sure if Iâve read that paper before or not, but Iâve used that exact thought experiment in conversations before. It does a great job of getting into the nature of biochemical interactions that most people just donât intuitively understand. The sorts of biochemical interactions ENCODE describe are not directly indicative of function, at least not in any way that is interesting.
Not only would I expect to see such âENCODE-functionalityâ from random sequences, Iâd be extremely surprised by its absence.
Random sequences could certainly have function in a general sense, in that they could do âsomethingâ that we might find interesting or otherwise alter fitness.
I donât think it would impact the utility of the control in any way we care about. It is still a baseline comparison: On average 1Mb of random DNA has X biochemical interactions and Y âactualâ functions compared to W biochemical interactions and Z âactualâ functions in 1Mb of some biological sequence.
I would expect upper limits on the length of synthesized DNA fragments, so you would have to clone one or two segments at a time. I would expect an upper limit on BAC size, but Iâm sure there are ways around that as well.
If we want to use random sequence as a âno functionâ control then it could be problematic if random sequence has actual function. However, many negative controls in other data sets have real hits, and significant results are often defined as ânumber of hits over a random controlâ. Also, if ENCODE would consider almost any random sequence to be 80% functional then it certainly calls their methods into question.
The challenge here is to look at function during all phases of a eukaryotic animal life span starting from embryo development. Certain transcription activity is different depending on the phases from initial embryo development to the adult animal.
Thatâs not a challenge, especially if you use a model organism like mice. The larger challenge is covering all tissue types within each age range.
Letâs say we do as you suggest with our random chunk of DNA and get good coverage across tissue types. We will also use the methods and criteria set out in the ENCODE paper. We find that 80% of the random DNA sequence has function according to those criteria. Would this indicate that function is easy to find in DNA sequence?
In our paper in this weekâs PNAS ](http://dx.doi.org/10.1073/pnas.1307449110), we take a stab at answering this question with one of the largest sets of randomly generated DNA sequences ever included in an experimental test of function. We tested 1,300 randomly generated DNAs (more than 100 kb total) for regulatory activity. It turns out that most of those random DNA sequences are active. Conclusion: distinguishing function from non-function is very difficult.
To test DNA for function, we used a new technique to measure whether a piece of DNA can regulate a downstream gene (a barcoded DsRed reporter gene). One way to define functional DNA in the context of this experiment is âany piece of DNA that reproducibly regulates the reporter gene.â
We tested about 2,000 native sequences from the genome (more about that in my next post), and, as a negative control, we also tested random DNAs, DNAs created by scrambling the sequences of genomic DNA.
As do the numerous random sequence experiments with both RNA and peptides. Sequence space is lousy with function, if you use a sufficiently broad definition of âfunctionâ. Of course, ID proponents insist on using the narrowest possible definition when discussing random sequence experiments and the broadest possible when discussing junk DNA, because otherwise their entire argument would fall apart.
It either does something sufficiently important the region will show up on GWAS or insufficiently important to be relevant to any discussion of ID. And guess what: We already have the answer! Most of the genome is non-functional at the level relevant to ID.
We donât want to use them as âno functionâ controls, because âfunctionâ is nearly impossible to define in a way that is universally useful. We want to use it as exactly what it is: a random control. Or, as your original reference calls it, a ânoise controlâ. In any given noisy system some of the noise will, by chance, be âmeaningfulâ by whatever standard is relevant for the system in question. The question is: Does the sequence of interest differ quantitatively or qualitatively from a random sequence in some way, be it biochemical interactions or any other definable metric.
Exactly so.
I think we are well past the point of ENCODEâs methods for âfunctionalâ annotation being called âinto questionâ. But then, that is primarily a problem of the immense difficulty of defining âfunctionâ in the first place.