Methods summary: Addressing (one of) the challenges of RADseq

Article by Evan McCartney-Melstad and Brad Shaffer from University of California at Los Angeles

RADseq is a great method for gathering genomic data to answer biological questions across many different scales, from phylogenetics to population and landscape genetics. It is fast, inexpensive, and requires no previous knowledge about the species’ genomic architecture. However, with this flexibility comes challenges. In this paper we develop and bench test an approach to address what may be the biggest RADseq challenge: how to choose the right sequence similarity threshold that defines whether two non-identical sequencing reads arose from the same or different genomic locations. This problem goes to the heart of evolutionary genetics— if two sequences are considered to be homologous, or derived from the same ancestral genomic location with subsequent modification through time, then they tell us a great deal about evolutionary history. If they are paralogous, and map to separate locations, then they lack that shared evolutionary history. Getting this straight is perhaps the single most important step in using genomic data for evolutionary inference.

Heat maps showing pairwise data missingness at clustering thresholds of 88% (a) and 99% (b). 

Studies that include relatively distantly related samples, such as those asking phylogenetic or biogeographical questions, should expect that homologous sequences will have diverged over time and therefore require lower similarity thresholds that allow for that divergence. However, if the threshold is set too low, paralogs will be falsely assigned to the same genomic locus, leading to problems ranging from inflated missing data rates to inaccurate measures of genetic diversity. Rather than relying on rough guesses that are preset in software packages, our approach attempts to balance these two competing forces by quantifying the relationship between pairwise genetic relatedness (as estimated directly from the data) and summaries of the RADseq dataset including pairwise data missingness and the slope of isolation by distance among samples. The relationship between pairwise genetic distance and pairwise data missingness is particularly informative—although some positive correlation is expected as mutations accumulate in enzyme restriction sites that RAD relies on, there is often a clear pattern of increased pairwise missingness that occurs when the most divergent homologous allelic variants begin to be erroneously oversplit into different presumptive loci. By explicitly looking for this breakpoint as a function of clustering threshold, researchers can choose a value that allows them to maximize the number of genomic regions recovered while minimizing the erroneous oversplitting of highly divergent, but homologous loci.

Citation: McCartney‐Melstad, E, Gidiş, M, Shaffer, HB. An empirical pipeline for choosing the optimal clustering threshold in RADseq studies. Mol Ecol Resour. 2019; 19: 1195– 1204. https://doi.org/10.1111/1755-0998.13029

As genomic and ecological data sets grow larger in size, researchers are flooded with far more information than was available when many conventional model-based approaches were designed. To deal with these massive amounts of data, many researchers have turned to machine learning techniques, which promise the ability to help find signals within the noise of the complex data sets generated by modern sequencing approaches. Applications for machine learning in molecular ecology are broad and include global studies of biodiversity patterns, species delimitation studies, and studies of the genomic architecture of adaptation, among many others. Here at Molecular Ecology Resources, we are excited to highlight research that applies supervised and unsupervised machine learning algorithms to answer questions of interest to the readership of molecular ecology. This special issue will also highlight the nuances and limitations of machine-learning techniques. Rather than focusing on the supposed differences between machine-learning and model-based approaches, this issue would aim to highlight the broad spectrum of machine-learning approaches, many of which can incorporate model-based expectations and predictions.

We are soliciting original research that applies novel robust applications of machine learning methods on molecular data to address questions across ecological disciplines.

Details

Manuscripts should be submitted in the usual way through the Molecular Ecology Resources website. Submissions should clearly state in the cover letter accompanying the submission that you wish the manuscript to be considered for publication as part of this special issue. Pre-submission inquiries are not necessary, but any questions can be directed to: manager.molecol@wiley.com

Special issue editors: Nick Fountain-Jones, Megan Smith & Frédéric Austerlitz

Intra-specific variation and the algal microbiome

Individuals within a species vary, and this variation can have important implications for the role a species may play within ecosystems. We compared the relative importance of variation within species due to genetic changes within its own genome versus symbiotic interactions between the focal species and its associated bacteria, also called their microbiome. We focused on Microcystis aeruginosa, a globally distributed photosynthetic cyanobacterium, also known as blue-green algae, that often dominates freshwater harmful algal blooms.

Colony of Microcystis aeruginosa from Gull Lake. Colony photographed by O. Sarnelle of Michigan State University and image prepared by John Megahan of University of Michigan.

These blooms have recently become more common and intense worldwide, causing major economic and ecological damages. We studied Microcystis and their associated microbiomes from lakes in Michigan, USA that vary in phosphorus content, which is the primary limiting nutrient in lakes. We found genomic changes among strains of Microcystis along this phosphorus gradient that indicated increased efficiency in the use of phosphorus and nitrogen. Intriguingly, we found that genotypes adapted to different nutrient environments co-occurred in phosphorus‐rich lakes. This co-occurrence may have critical implications for understanding how Microcystis blooms persist for many months, long after nutrients become depleted within lakes. Similar to previous findings in for example the human microbiome, we uncovered that the bacteria comprising the microbiomes of Microcystis varied in community composition but were more stable at the level of functional contributions to their hosts across the phosphorus gradient. Finally, while our work was mostly focused on unraveling the genomic underpinnings of nutrient adaptation, we also observed consequences of these differences in Microcystis genome and microbiome composition at a physiological level. In particular, when nutrients were provided in abundance, Microcystis (and its microbiome) that had evolved to thrive in low-phosphorus environments could not grow as rapidly as strains from high-phosphorus environments.

Sara Jackrel, Postdoctoral Fellow, University of Michigan.

Read the full article here.

Citation: Jackrel, SL, White, JD, Evans, JT, et al. Genome evolution and host‐microbiome shifts correspond with intraspecific niche divergence within harmful algal bloom‐forming Microcystis aeruginosaMol Ecol. 2019; 28: 3994– 4011. https://doi.org/10.1111/mec.15198

Interview with the Author: Conservation of old individual trees and small populations is integral to maintain species’ genetic diversity of a historically fragmented woody perennial

What is the unit of conservation? Is it similar for different types of plants? How does the reproductive biology of the organism can inform the best practices in conserving threatened species? In her Doctoral research, Nicole Bezemer is studying Eucaliptus species from South Western Australia to better understand population dynamics in long-lived organisms and how this can lead to better management of their populations. Surprisingly, many of the small and fragmented populations of the two subspecies of E. caesia she studied are genetically differentiated at a fine spatial scale, and high levels of heterozygosity persists even in populations with a dozen of individuals. Nicole and colleagues suggest the clonal and perennial nature of E. caesia might contribute to these unusual patterns of genetic diversity and divergence, and suggest that traditional conservation genetic approaches might be detrimental for naturally fragmented species with these life-history characteristics. Read here about her experience in developing this research.

A multi-stemmed genet of Eucalyptus caesia at Mocardy Hill, Western Australia. Photo by NB.

What led to your interest in this topic / what was the motivation for this study? 
Eucalyptus caesia is an intriguing study species, given the combination of a distribution on scattered granite outcrops, a long history of geographic and genetic insularity, a capacity for individual longevity via lignotuber re-sprouting, a lack of recent recruitment in most known stands, and adaptation for pollination by nectarivorous birds. After completing my Honours research at the Boyagin stand of E. caesia, I was hooked. The present study came into fruition upon discovering that one of my PhD experiments, involving 6 months of controlled cross-pollinations, was killed by a series of frosts. I had already genotyped two large stands of E. caesia and I was curious about what patterns of genetic structure might exist in other stands, and across the species’ landscape distribution. 

What difficulties did you run into along the way? 
Some stands of E. caesia are located on immense granite outcrops, often hidden in hard-to-access gullies or behind thick barricades of vegetation. The first challenging aspect of the project was to find the sub-populations of E. caesia at each new location. For many populations, I did so by embarking on a Google Earth tour led by my supervisor, Steve Hopper, who has worked on the granite outcrop flora of south-west Australia and on E. caesia for nearly four decades. Nonetheless, I spent many hours traversing granite outcrops, sometimes in circles, which occasionally led to finding additional plants or, in the case of the E. caesia at Old Muntadgin, a previously undocumented population of several hundred plants.

What is the biggest or most surprising innovation highlighted in this study? 
I was surprised by the apparent lack of genetic interconnection between some stands over relatively small spatial scales. Given the long history of population fragmentation and reproductive biology of E. caesia (multiple modes of reproduction and gravity-dispersed seed), I anticipated that high levels of genetic differentiation would feature. Regardless, it was surprising to find that, in some instances, the level of genetic differentiation within stands exceeded that among stands. Another interesting result revealed by comprehensive genotyping were some very small census population sizes. Seven stands were comprised of fewer than ten unique multi-locus genotypes, and three locations had only one or two genotypes. Localised clonal reproduction is clearly of paramount importance to the persistence of these stands.

Moving forward, what are the next steps in this area of research?
The next step is to further test the genetic integrity of the two subspecies, E. caesia subsp. caesia and E. caesia subsp. magna, by genotyping plants from additional stands. Walyamoning and Yanneymooning are geographical outliers to other stands of subsp. caesia and occur within relatively close proximity to the group of subsp. magna populations located in the north-east of the species distribution. We propose to genotype a sample of individuals from the two outlier populations of subsp. caesia stands, and at three additional locations of subsp. magna, to test whether the two subspecies are genetically distinct even when populations are sympatric, and to determine if hybridisation has occurred.

What would your message be for students about to start developing or using novel techniques in Molecular Ecology? 
My message to other young or early-career researchers is to have a clear research outcome in mind before exploring the application of novel techniques. Avoid putting yourself in the position of having to come up with a hypothesis after the fact.

What have you learned about methods and resources development over the course of this project? 
Comprehensive genotyping at multiple spatial scales may provide a more complete picture of spatial genetic structure compared to studies where sampling efforts are focused on few individuals from many populations, or on many individuals from few populations. There is still much to be gained from population genetic studies, especially in understudied, biodiverse, endemism hotspots such as granite outcrops, and in understudied systems such as small, historically fragmented populations of long-lived trees.

Describe the significance of this research for the general scientific community in one sentence.
Anciently fragmented plant populations may be adept at persisting as small populations with low genetic diversity and limited genetic interconnection, and therefore attempts to connect such populations may be ineffective or even harmful.

Describe the significance of this research for your scientific community in one sentence.
Small populations of long-lived woody perennial plants, even those comprising a handful of individuals, may contain unique genotypes that contribute to overall species genetic diversity, and are worthy of conservation.

Enjoying the afternoon light from my field base camp underneath Eucalyptus caesia at Boyagin Rock. Photo by NB.

Interview with the author: A guide to the application of Hill numbers to DNA based diversity analyses

image
Diversity assessment procedures in traditional and DNA sequencing‐based approaches. Recorded entities need to be classified into types, before each type is weighed according to its relative abundance and the order of diversity (q). Note the example refers to an abundance‐based, rather than incidence‐based, approach

What are Hill Numbers? What do they have to do with estimating biodiversity? How can you use them as a Molecular Ecologist? Read the recent review in Molecular Ecology Resources by Antton Alberti and Thomas Gilbert on this topic, and read the interview with Antton below to learn how they think about Hill numbers and their applications to metabarcoding. Also, check hilldiv, “an R package to assist analysis of diversity for diet reconstruction, microbial community profiling or more general ecosystem characterisation analyses based on Hill numbers, using OTU tables and associated phylogenetic trees as inputs. The package includes functions for (phylo)diversity measurement, (phylo)diversity profile plotting, (phylo)diversity comparison between samples and groups, (phylo)diversity partitioning and (dis)similarity measurement. All of these grounded in abundance-based and incidence-based Hill numbers.”

What led to your interest in this topic / what was the motivation for this study? 
Measuring, estimating and contrasting biological diversity are central operations in most ecological studies. In the last decades, dozens of diversity indices and metrics have been proposed, each with their individual strengths and weaknesses, and specific mathematical assumptions. The measures that many of them yield are difficult to interpret, because the values might refer to abstract units, which lack an straightforward interpretation for non-specialists. We believe that the statistical framework developed around the Hill numbers overcomes many of these problems, and provides a statistical toolset that is extremely useful for ecologists. Besides, Hill numbers enable incorporating complementary information, such as phylogenetic dissimilarities across organisms, which are really handy for molecular ecologists who can easily build phylogenetic trees from metabarcoding data.

What difficulties did you run into along the way? 
We are a molecular ecologist and an evolutionary biologist that use many different mathematical tools, but are not expert mathematicians. Hence, of the main challenges was to make sure that all the statements and mathematical interpretations were correct!

What is the biggest or most surprising innovation highlighted in this study?
The aim of our review was to demonstrate to ecologists, who like us might have a limited mathematical background, that implementing the framework developed around the Hill numbers is not difficult, and has big potential gains. In our review we gathered information and tools generated by others, mainly Lou Jost, Anne Chao and Chun-Huo Chiu, and displayed them in a comprehensive way for molecular ecologists. We have tried to explain complex mathematical formulations in layman terms, exactly as we would like others to explain us other contents we are not familiar with. We have provided examples and pieces of code, that we hope will encourage other researches to use these tools.

Moving forward, what are the next steps in this area of research?
Our article mainly focuses on diversity measurement from data generated using DNA metabarcoding. While bioinformatic methods to generate metabarcoding data have received much attention in the last decade, the impact of the statistical approaches used to analyse diversity has been less studied. Assessing their impact and providing guidelines for selecting the tool best suited to address specific questions with specific types of data, will be an important next step in the area of metabarcoding-based diversity analyses.

What would your message be for students about to start developing or using novel techniques in Molecular Ecology? 
Despite the fact that they might at first seem complex and abstract, bioinformatic and statistical tools are necessary to address ecological questions. Hence, we would encourage students to try to understand the basic bioinformatic and statistical procedures, so as to be able to select the best tools to address their research questions.

image
Differences between abundance‐based and incidence‐based Hill numbers. The Hill numbers yielded for the entire system are different depending on the approach employed. In abundance‐based approaches, the DNA sequence is the unit that the diversity is computed on, while in incidence‐based approaches, it is the sample the unit upon which the diversity is measured. (*) The asterisk indicates that the equations are undefined for q = 1, thus in practice either the 1D formula shown in Table 1 or a limit of the unity must be used, for example, q = 0.9999. However, q = 1 is used for the sake of simplicity

What have you learned about methods and resources development over the course of this project?
That its not the most broadly-employed tools that are always the best way to address scientific questions!

Describe the significance of this research for your scientific community in one sentence.
Hill numbers provide powerful, solid and versatile tools with which to carry out most of the analyses that are needed to assess biological diversity within a common statistical framework.

Interview with the author: Killer whale genomes reveal a complex history of recurrent admixture and vicariance

By Robert Pittman – NOAA (http://www.afsc.noaa.gov/Quarterly/amj2005/divrptsNMML3.htm%5D), Public Domain, https://commons.wikimedia.org/w/index.php?curid=1433661. Two killer whales jump above the sea surface, showing their black, white and grey colouration. The closer whale is upright and viewed from the side, while the other whale is arching backward to display its underside.

In this study, Foote et al. study the complex demographic history of killer whales and show how episodic gene flow is ubiquitous in their natural populations. This observation adds to the incresing recognition that the traditional geographical characterization of populations (i.e., allopatry, parapatry, and sympatry) is dynamic over time. Although in general it is difficult to perform deep sampling across the range of a species, cut through artificial taxonomic boundaries, and access enough genomic resources for a taxon, their journey is a great example as to how to do this, and how powerful population genetic methods can reveal the history of vagile and amply distributed species on earth.

What led to your interest in this topic / what was the motivation for this study? 
I’ve been working together with Phil Morin at Southwest Fisheries Science Centre for the last ten years, using genetic data to try and unravel the complex demographic and evolutionary history of killer whales. Some of the key questions have been, whether killer whale ecotypes arose from independent founder events and secondary contact, or through gradual divergence in sympatry. This study started out trying to model those processes (in collaboration with Laurent Excoffier) using genomes we had previously sequenced for a subset of the well-described killer whale ecotypes. We struggled to find a good model to fit the data, and it eventually became clear that we just had too few pieces of the jigsaw to be able to see the complete picture. We decided to cast a wider net and looked back at our previous global study published in Molecular Ecology in 2015, to select a dataset of samples that was representative of the global genetic variation in killer whales for genome sequencing. Having worked in the Centre for GeoGenetics, Copenhagen and the CMPG, Bern – both largely focused on human genetic variation, and being keen follower of that literature, it was a great opportunity to apply methods developed in that field on the killer whales.

What difficulties did you run into along the way? 
Arguably, the biggest hurdle to overcome was bringing clarity to the very complex relationships between these killer whale populations. This was exacerbated by trying to include too many analyses in earlier drafts. We had a draft manuscript ready almost a year ago, which consisted of two parts: the demographic and evolutionary history of these populations; and the genomic consequences of these different demographic histories. However, this manuscript had become a behemoth! Thankfully, Jochen Wolf, one of the first coauthors to tackle a full read-through of this weighty tome, suggested this might be better digested in separate sittings. So the paper became focused on the evolutionary history and hopefully is an easier read…thanks to Jochen.

What is the biggest or most surprising finding from this study? 
The ghost ancestry in the Antarctic types, which was something I had suspected we might find, was only really possible to test for due to methods being released as we were writing up the paper. Clearly, we weren’t the only ones thinking along these lines, as several other studies on species including seabass and bonobos released similar findings of ghost ancestry around the same time – this is really nicely highlighted in the perspective by Jacobs and Therkildsen, in the same issue of Molecular Ecology.

Moving forward, what are the next steps for this research?
A key interest is how variation in the genomic architecture, principally local recombination rate, influences the frequency of different ancestry components within a population and how that relates to past demographic history. As eluded to above, we have results on the impacts of these complex demographic histories in a study we are just finishing up. As a follow up, we will explore further the history of the ghost ancestry, to find out if it conveys any benefits (adaptive variants) or costs (mutation load), such as we see in Neanderthal ancestry in modern humans.  And ultimately we hope to better understand the underlying processes determining the genetic differentiation between sympatric killer whale ecotypes.

What would your message be for students about to start their first research projects in this topic?
I’d recommend having a good understanding of the concepts, methods and models commonly used in population genetics. I’ve been reading Matt Hahn’s Molecular Population Genetics book and Graham Coop’s Population Genetic Notes, which is freely available to download from Graham’s brilliant blog – gcbias.org. Often methods will give seemingly contradictory results, and so it is important to be able to understand how those analyses work to be able to puzzle out the different signals from different methods. The two resources above will also help you design your sampling scheme and plan your study out ahead of time, so that it is best suited to the question you are trying to address.  

What have you learned about science over the course of this project?
I feel I’ve learned a lot. It has been a labour of love, the sequencing even being partly funded by my Swiss pension scheme which I cashed in when I left Bern. So, I didn’t feel like I had to please anyone but myself, and to be honest, I thought it was such a complex story and quite species-focused that it wouldn’t be of broad interest. But in fact, it is the paper that I’ve had the most direct and positive feedback on from colleagues. So that has been both surprising and satisfying. The lesson I take from that is to always try and work on something that you are passionate about.

I also feel that as I was learning to better understand the methods and the analyses, I was trying to really hard to pass that on to the reader, assuming they may be as naïve as I was before I delved into this study. And based on the feedback, that is something that folk appreciate, and which makes the paper more intuitive and transparent. I have tried to expand upon this in a youtube video.

image
(a) Sampling locations of the individuals for which 26, 5× coverage genomes were generated (global data set). Marker colours are as per the PCA legend. An additional 20 low coverage genomes (ecotype data set) were used in some analyses, see Foote et al. (2016) for sampling locations. (b) PCA plots of the combined global and ecotype data sets, and (c) the global data set (one sample per population). (d) Individual admixture proportions, conditional on the number of genetic clusters (K = 2 and K = 3), for the combined global and ecotype data sets, and for (K = 2) (e) when only one 5× coverage genome per population from the global data set is included

Describe the significance of this research for the general scientific community in one sentence.
Genomes sequences are a record of the many genealogies that comprise our ancestry. Our study highlights how a relatively small number of genomes can reveal the complex relationship among populations, past and present, across the globe.

Describe the significance of this research for your scientific community in one sentence.
Our study highlights that marine scientists need to consider connectivity through time, to past populations, as well as space to better understand the genetic composition of present-day populations.