I’m not quite clear on what you mean here, as it isn’t the sort of language we use in phylogenetics to describe what you may be talking about. We would expect a certain amount of difference between trees built by different methods, largely because the methods have different assumptions about the nature of the data which may be violated to greater or lesser degree by the data sets. That’s quite different from what you describe in coin flips.

Now, if I understand the coin flip analogy here, it’s similar to the various sites of a DNA sequence being thought of as pulled from some ideal distribution, which methods in fact usually assume. Purely by chance, we may get a data set that isn’t representative of the distribution, and thus analysis of that data would produce a false tree. Is that what you mean? Of course that’s possible, though the probability decreases as sample size increases and as the extent of the required bias increases. In other words, 100 heads in a row is less probable than 10 heads in a row, and 10 out of 10 coins being heads is less probablye than 7 out of 10. I think our data in the croc paper surpasses any significant probability of that error on both counts.

Incomplete lineage sorting is quite another question. We do expect it to happen, with a frequency inversely related to branch lengths between divergence events and directly to population size. The way to deal with that is by assessing many independently assorting loci. That’s where that mention of prior publications comes in.

Not through any formal test, but I’d say the null expectation would be that the data would not support any tree over any other, so the various tests of support we performed should serve. Check the Kishino-Hasegawa test and the non-parametric phylogenetic bootstraps.