Foundations of Concordance Views of Phylogeny
Joel Velasco
A common picture in evolutionary biology is that of a ‘species tree’ that describes the evolution of populations through time, and within which genes trace their ancestry within and between the ‘fat branches’ of this tree. But a natural thought is that a phylogeny should represent the flow of genetic information and that this is, in some sense, more fundamentally what evolutionary history is about. In this picture, the ‘dominant history’ is defined by looking directly at the history of the genes themselves rather than the organisms.
Here we consider phylogenies as defined by ‘concordance methods’, which are an attempt to make precise the notion of a phylogeny as representing the dominant genetic history. Concordance views get their start from Baum and Shaw (1995) who sought to define species as genealogical units of phylogeny using the concept of ‘an exclusive group’: ‘A group of organisms is exclusive if their genes coalesce more recently within the group than between any member of the group and any organisms outside the group’ (p. 296). It is difficult to precisely define this notion.
Velasco (2010) suggests that for a and b to be in the same taxon, they have to be more closely related to each other than to any organism outside the group. In practice, this has turned into a definition based on the ‘reciprocal monophyly of gene trees’, which is a concordance factor based definition. Many authors assume these are the same (for example, Hudson and Coyne (2002) and Velasco (2010)) but in fact they can come apart.
We show this ‘triplet based’ definition of exclusivity turns out to be equivalent to the R* consensus method discussed in several places including Degnan and Rosenberg (2009). It can be shown that the R* clusters form a unique tree, but that it can differ from the plurality consensus tree. They both refine the majority consensus tree and both are reasonable possibilities for definitions of clade or of exclusive group. Further complicating matters, Ané et al. (2007) implements a particular procedure for estimating ‘the primary concordance tree’, but their program, BUCKy (updated in Larget et al. (2010)), uses a ‘greedy consensus’ algorithm in order to construct the concordance tree. This procedure produces a greedy consensus tree which can fail to correspond to either the R* or plurality consensus tree.
Once we have formally defined concordance factors and concordance trees, we can ask what role they might play in systematics. Numerous authors treat the primary concordance tree as an estimate of the species tree and it has been shown that as such, the concordance tree is not always the best estimate we can make. For example, Leache and Rannala (2011) use simulated data to argue that in a range of circumstances, BUCKy is outperformed by BEST (Bayesian Estimation of Species Trees as developed in Liu 2008).
There are also theoretical worries. For example, Degnan and Rosenberg (2006, 2009) can be used to construct arguments that, at least in theory, concordance analysis is not a statistically consistent estimate of the species tree and the concordance tree is not the best estimate of the species tree. But much of the above line of thinking is based on a fundamental mistake. On the concordance view, the primary concordance tree cannot possibly be a bad estimate of the species tree. It just is the species tree. Baum (2007, 2009) and Velasco (2010) are all clear on the point that the very meaning of ‘taxon’, ‘clade’ and ‘species’ is tied to the primary concordance tree. So the primary concordance tree is meant to be an object of interest in itself and not merely (or even primarily) a useful means to infer the species tree as defined by the population history. In addition, it is important to note that the assumptions built in to the multispecies coalescent framework are quite extreme. It is not even clear what the species tree is or how to define it when some of these assumptions are violated (such as when there is hybridization across lineages) or when not all of the taxa under examination have the same set of genes. But we can give natural generalizations of the plurality consensus and the R* method in the supertree setting which turn out to have a number of very desirable properties.
And while a series of authors have shown or accepted claims linking concordance trees to species trees, it is important to understand the limits of what has been shown. The theorems mentioned here all relate concordance trees to species trees which lead to gene trees in exact accordance to strict multispecies coalescent models which assume random mating within lineages, instantaneously splitting of lineages into new species, no selection and no sources of discordance other than incomplete lineage sorting. Even under such strict assumptions, species trees may differ from concordance trees. When they do, this is a case of expectations differing from outcomes due to chance factors. We think there are reasons to favor the view that phylogenies should track the outcomes rather than the expectations. When the assumptions of the model are violated, it might not even be clear what ‘the species tree’ refers to and when it is clear, there are good reasons to take concordance trees to be a better interpretation of phylogeny. For example, it is worth noting that genetic history is epistemologically tractable while organismal history is often not.
But even setting this aside, we agree with Baum (2009) who argues that, “When the realized patterns of genetic relatedness deviate from expectation, I think we should recognize taxa based on what actually happened rather than worry about what should have happened. In particular, it seems preferable to adopt gene genealogical exclusivity, which is sensitive to a history of selection, rather than organismic exclusivity, which is not.”