Hidden forests in the trees: erroneous species boundaries from genomic approaches
Lacey Knowles & Jeet Sukumaran
Scientists rely upon the accurate detection of species to address questions about biodiversity and biology more generally. The unprecedented amount of DNA sequence data made available by recent technological advances is changing how biologists identify species. Specifically, applications of genomic data have great power to reveal the boundaries separating species. However, this technological advance for delimiting species was only made possible by a fundamental shift in how DNA is conceptualized in the context of different species – namely, a focus on the lineages of species themselves, as opposed as to the lineages of individuals genes. In this framework, the species lineages may contain shared gene lineages and/or the differences in the ancestries of gene lineages do not confound efforts to delimit the species. For example, under the model used for inferring species boundaries from genomic data (i.e., the multispecies coalescent), statistical statements about the probabilities of species boundaries and the number of species represented in a collection of genomic sequences from different individuals are based on patterns of discord across genes. This contrasts with a tradition of relying on a concordance principle for inferences about what are and are not species, whether it was a criterion of monophyly (i.e., concordance between the splits in gene trees and a tree of species relationships) or concordance across independent data (i.e., seeking corroborating evidence based on concordance across genes).
These conceptual (and analytical) shifts away from concordance were critical for avoiding the contrivance concordance criteria impose on endeavors of species delimitation. That is, DNA made it possible to delimit specie without having to wait for concordance to accrue over evolutionary time (a process that occurs through the biological process of genetic drift), which is far removed from the time of speciation (i.e., when new species are formed). However, this shift away from concordance, and the reliance upon statistical evaluation of the expected amount of discord in genomic data under different models of putative species boundaries has created its own unique set challenges. Specifically, with increased amounts of sequence data, the genetic differences that are detected are not just associated with species boundaries, but include genetic differences among populations within species. That is, for applications of genomic data for species delimitation, the theoretical ideals of the methods currently being applied are clashing with the biological realities of how new species form, which is not an instantaneous event but is protracted overtime. Our approaches that aim to harness the power of genomic data are missing the mark when it comes to accurate detection of species boundaries, the consequence of which has profound implications across biology because species are the basic unit of reference for framing biological questions.