Reduced representation sequencing (e.g. RAD and GBS) is becoming ever more popular, but for species which lack a reference genome, little work has been done to assess which software may be best suited to building de novo assemblies from this data. Here, we speak to Melanie LaCava of the University of Wyoming about her recent Molecular Ecology Resources article, which explores the accuracy of de novo assemblies built by various software programs using DNA generated from double-digest libraries. Melanie and her co-authors found highly variable degrees of accuracy of assemblies built by six different software programs, and discuss which programs are best suited to this application. They also highlight the importance of optimising parameter settings within any given software. Read on to get a behind-the-scenes view of this study.
What led to your interest in this topic / what was the motivation for this study?
This study began as a research project in a graduate-level course on computational biology at the University of Wyoming led by the senior author on the paper, Alex Buerkle. Dr. Buerkle initiated the project and worked with the rest of the coauthors to pursue this de novo assembly software comparison. As reduced representation genotyping-by-sequencing has become more popular, new and repurposed software programs have been applied to each step in the bioinformatics pipeline. When a reference genome is unavailable for a study species, de novo assembly is essential, yet we recognized a gap in the evaluation of software used for this important step.
What difficulties did you run into along the way?
Technology and software associated with genotyping-by-sequencing and de novo genome assembly are rapidly changing. During the course of our project, some of the software programs we tested were significantly updated, so we chose to rerun our analyses using the new software versions to ensure we were providing up-to-date information in our manuscript.
What is the biggest or most surprising finding from this study?
We were surprised to find such a substantial difference in performance among these assembly programs. We were especially surprised at the variation in performance among software for our first simulation where no mutations were introduced. In this scenario, we made many identical copies of genome fragments and then performed de novo assembly using each software program. Without any mutations introduced, the job is basically to generate a list of unique sequences – it should be very straightforward. In some cases, however, these genome fragments were broken into shorter sequences and rearranged beyond recognition, leading to incorrect reconstruction of the simple, unmutated data.
Moving forward, what are the next steps for this research?
For our study, we selected a sample of assemblers from peer-reviewed literature that use different assembly algorithms, are freely available, and have updated user resources available online. However, this was not a comprehensive evaluation of all software capable of de novo assembly. Therefore, the evaluation of other programs would be valuable. Additionally, as new software programs are introduced or existing programs are updated, continued efforts to evaluate de novo assembly performance is warranted.
What would your message be for students about to start their first research projects in this topic?
Reduced representation genotyping-by-sequencing is becoming less expensive and more accessible, making it a viable option for more research projects. While it is exciting to apply these emerging technologies and methods, it is important to recognize that approaches to filter and analyze these large datasets are still in development. Doing your background research to ensure you are applying the best available tools and using the most appropriate methods for your study is essential to doing good research in this field and in any field of research.
What have you learned about science over the course of this project?
Doing this study has reaffirmed the importance of simulations to test how software works. Testing analyses on simulated data and altering parameters of the simulation or analysis can provide immense insight into how the software works and how variation in real data may affect software performance. Larger simulation projects like our study can provide information that many people can use, but I also find it incredibly helpful to run a simulated dataset through an analysis before analyzing my own data to ensure I understand what the software is doing. Taking advantage of simulated datasets available in vignettes for software is a great tool to get acquainted with the analyses you plan to do.
Describe the significance of this research for the general scientific community in one sentence.
Our study demonstrates the importance of ensuring that software you use is really doing what you think it is supposed to do; and simulations can help evaluate software performance.
Describe the significance of this research for your scientific community in one sentence.
Researchers who need to perform de novo assembly of reduced representation genotyping-by-sequencing data can use our study as a guide for which software to use and the importance of different parameter settings for assembly.
LaCava, M. E., Aikens, E. O., Megna, L. C., Randolph, G., Hubbard, C., & Buerkle, C. A. (2019). Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software. Molecular ecology resources. https://doi.org/10.1111/1755-0998.13108