RADseq is a great method for gathering genomic data to answer biological questions across many different scales, from phylogenetics to population and landscape genetics. It is fast, inexpensive, and requires no previous knowledge about the species’ genomic architecture. However, with this flexibility comes challenges. In this paper we develop and bench test an approach to address what may be the biggest RADseq challenge: how to choose the right sequence similarity threshold that defines whether two non-identical sequencing reads arose from the same or different genomic locations. This problem goes to the heart of evolutionary genetics— if two sequences are considered to be homologous, or derived from the same ancestral genomic location with subsequent modification through time, then they tell us a great deal about evolutionary history. If they are paralogous, and map to separate locations, then they lack that shared evolutionary history. Getting this straight is perhaps the single most important step in using genomic data for evolutionary inference.
Studies that include relatively distantly related samples, such as those asking phylogenetic or biogeographical questions, should expect that homologous sequences will have diverged over time and therefore require lower similarity thresholds that allow for that divergence. However, if the threshold is set too low, paralogs will be falsely assigned to the same genomic locus, leading to problems ranging from inflated missing data rates to inaccurate measures of genetic diversity. Rather than relying on rough guesses that are preset in software packages, our approach attempts to balance these two competing forces by quantifying the relationship between pairwise genetic relatedness (as estimated directly from the data) and summaries of the RADseq dataset including pairwise data missingness and the slope of isolation by distance among samples. The relationship between pairwise genetic distance and pairwise data missingness is particularly informative—although some positive correlation is expected as mutations accumulate in enzyme restriction sites that RAD relies on, there is often a clear pattern of increased pairwise missingness that occurs when the most divergent homologous allelic variants begin to be erroneously oversplit into different presumptive loci. By explicitly looking for this breakpoint as a function of clustering threshold, researchers can choose a value that allows them to maximize the number of genomic regions recovered while minimizing the erroneous oversplitting of highly divergent, but homologous loci.
Citation: McCartney‐Melstad, E, Gidiş, M, Shaffer, HB. An empirical pipeline for choosing the optimal clustering threshold in RADseq studies. Mol Ecol Resour. 2019; 19: 1195– 1204. https://doi.org/10.1111/1755-0998.13029