Interview with the author: Creating the SPIKEPIPE metagenomic pipeline

Reliable abundance estimates is a significant challenge for eDNA metagenomic studies. One important issue is that sequencing introduces multiple sources of noise that can significantly alter the accuracy of abundance estimates. Here we interview Douglas Yu, a professor at the University of East Anglia, about the SPIKEPIPE pipeline recently published in Molecular Ecology Resources. This method is particularly exciting as it can use either short read barcodes or mitogenome data to estimate species abundances by accounting for sequencing noise using correction factors. They test this eDNA pipeline on arthropod samples taken from the High Arctic in Greenland and show that this approach can produce remarkably accurate species abundance estimates compared to samples of known composition. Read the full article here and get the code to run this pipeline here.

image
The 5 steps of SPIKEPIPE.

What led to your interest in this topic / what was the motivation for this study? 

We very much want to know how a heating climate is affecting biodiversity. Greenland is a direct window into this, both because heating has progressed very fast here, and because local species richness is manageable for study:  375 known aboveground arthropod species at the Zackenberg research station. Equally important, the Danish research station at Zackenberg had had the foresight to systematically collect arthropods starting in 1996, and those samples were sitting in ethanol in a warehouse in Denmark. The main obstacle to using them had been that no one could identify the hundreds of thousands of individuals to species level. Luckily, Helena Wirta and Tomas Roslin had in parallel carried out a DNA barcoding campaign at Zackenberg. Put together, we had in our hands a complete time series of community dynamics over a stretch of time during which summer had almost doubled in length. 

What difficulties did you run into along the way? 

When we started, we were all set to use metabarcoding. However, we soon learned (not surprisingly) that the sample-handling protocols had not been designed with molecular methods in mind:  the trap water was reused across time periods, the collecting net was used across traps, and the sorting trays were not bleached between samples. We thus needed a protocol that would be robust to cross-sample contamination and would ideally return quantitative information, since we wanted to detect change in population dynamics. This is why we turned to mitochondrial metagenomics (Tang et al. 2015, Crampton-Platt et al. 2016) and came up with SPIKEPIPE, which combines read-mapping, a percent-coverage detection threshold, and a spike-in to correct for pipeline stochasticity. 

What is the biggest or most surprising innovation highlighted in this study? 

The individual elements of SPIKEPIPE were reasonably well known, but what we hadn’t anticipated is just how accurate the results were when combined in a single pipeline. With mock samples, we found no false-positive species detections (when the percent-coverage threshold is applied) and recovered highly accurate estimates of intraspecific abundances (in terms of DNA mass). With resequenced environmental samples, we found high repeatability of abundance estimates across sample repeats, even though DNA extraction and Illumina library prep, sequencing, and base-calling all inject stochasticity into datafile sizes.

Also very gratifying was finding that SPIKEPIPE returned useful data even when mapping reads only to short DNA barcodes, as originally presaged by Xin et al. (2013). This means that we can make use of the existing vast DNA-barcode reference library.

Moving forward, what are the next steps in this area of research?

SPIKEPIPE is of course only the means to an end, and our next goal is the statistical analysis of community change in a rapidly heating ecosystem. Nerea Abrego and Otso Ovaskainen are now applying joint species distribution modelling (with the R package Hmsc, Tikhonov et al. 2019) to the dataset of 712 pitfall-trap samples. One important question is to quantify how much of the year-to-year variation in species abundances can be attributed to species interactions, as opposed to climate variables. 

More broadly, the result that SPIKEPIPE can be used with DNA barcodes makes possible an intriguing strategy:   one may now generate both the species reference database and the sample-by-species table from the same set of samples. We are using Greenfield et al.’s (2019) Kelpie software to carry out targeted assembly of DNA barcodes from shotgun-sequenced bulk samples, which we compile into a single DNA-barcode reference database, against which we then map reads from each sample to generate the data table. 

What would your message be for students about to start developing or using novel techniques in Molecular Ecology? 

Build in a lot of testing:  multiple, complex mock samples for pipeline development, repeat environmental samples to measure repeatability, realistically complex positive controls, many negative controls, and many sanity checks as you work through your bioinformatic code. 

You are likely to be learning to code at the same time that you write your first pipelines. Take the extra time *now* to learn and apply robust coding techniques, even if there are easier but less robust methods available. 

Read Jenny Bryan’s tutorial on file naming:  https://speakerdeck.com/jennybc/how-to-name-files

What have you learned about methods and resources development over the course of this project? 

A great way to inspire new methods is to talk with non-molecular researchers about their scientific questions, currently used methods, and available sample types. Our team includes arctic ecologists, molecular ecologists, and a mathematician.

For one’s method to have impact, it will need to be useful for years after one first thinks of it. Stay up to date with technology trends, including costs, to avoid rapid obsolescence.

Describe the significance of this research for the general scientific community in one sentence.

We can use DNA sequencing to quantify how insect and spider communities respond to environmental change.

Describe the significance of this research for your scientific community in one sentence.

Mitochondrial metagenomics is a viable alternative to amplicon sequencing for characterising arthropod communities.