Interview with the authors: Evaluation of model fit of inferred admixture proportions

Admixture models are widely-used in population genetics, but they make several simplifying assumptions, which, if violated, could result in misleading estimates of individual ancestry proportions. In a recent paper published in Molecular Ecology Resources, Garcia-Erill and Albrechtsen introduce evalAdmix, a program for detecting poor fit of admixture models to empirical data. evalAdmix uses the correlations of the residual differences between true and predicted genotypes to detect poor fit; when the assumptions of the model are not violated, the residuals of a pair of individuals should be uncorrelated. In simulation studies and analyses of empirical datasets, evalAdmix was useful in identifying model violations due to gene flow from unsampled ghost populations, continuous variation, population bottlenecks, and an incorrect assumed number of ancestral populations. Read the full article here, and read below for an exclusive interview with lead author Genís Garcia-Erill.

Full text: Garcia-Erill G. and Albrechtsen A. Evaluation of model fit of inferred admixture proportions. Mol Ecol Resour. 2020;20:936–949. https://doi.org/10.1111/1755-0998.13171.

Admixture model and evaluation with our method applied to worldwide human genetic variation. A. Admixture proportions inferred with ADMIXTURE assuming K=5 for all human populations from the 1000Genomes project. B. Evaluation of admixture model with the correlation of residuals performed with evalAdmix. Positive correlations are indicative of a bad model fit. The correlation of residuals shows that modelling with an ancestral population for each of the 5 major continental groups leads to a bad fit within most populations, and furthermore it gives additional information. For example we can see that the populations more genetically distant from the rest with which they are grouped, like Luhya in Webuye, Kenya (LWK) or Finish in Finland (FIN), have higher correlations of residuals, or it indicates the presence of substructure in some populations like the Gujarati Indians in Houston, TX (GIH).

What led to your interest in this topic / what was the motivation for this study?

The admixture model is one of the most used methods in population genetics, but it has already been known for some time that there are many potential issues with it. Specifically a recent study described very nicely different scenarios that can lead to wrong conclusions when applying the admixture model (Lawson et al. 2018). For example, they showed how multiple scenarios can lead to the same admixture results, and they also presented a method, badMixture, that can distinguish between those scenarios and evaluate model fit. However badMixture is quite difficult to apply, so we thought it would be interesting to develop an alternative method that could help in guiding the interpretation of admixture model results.

What difficulties did you run into along the way?

My background is in Biology and I had limited experience in computer science and statistics when I started with this project, so most of the difficulties were related to my learning how to work in these two disciplines. The method itself was relatively straightforward, but in order for it to work properly we needed to find a way to correct the bias caused by the frequency estimation. The frequency correction is only a small part of the main article, but it was where we put most of the work during the development of the method; that ended up as a few pages full of equations in the supplementary material. Another aspect where I had to put considerable effort was in making the implementation, since again I did not have much experience in developing software that would (hopefully) be used by other people. That made me consider things I would not usually think about.

What is the biggest or most surprising innovation highlighted in this study?

I think the method itself is the main result of the study. As I said there is already a method to evaluate the admixture model fit, badMixture. However that method is rarely used, because it requires performing additional analyses with CHROMOPAINTER and also requires having data with good enough quality to at least call genotypes. The method we present is more generally accessible since it is based on information unique to the admixture model itself, meaning one can directly apply it to any data set to which the admixture model has been applied. So it provides what we think is a simple way, both in the application and in the interpretation, to evaluate the admixture model results.

Moving forward, what are the next steps in this area of research?

There are several directions in which this work could be expanded. Something we already spent some time on is trying to develop a more firm theoretical foundation for the correlation of residuals as a measure of model fit, for example expressing it in terms of individual-specific Fst and the distance between the populations from which they are sampled, in a framework similar to that in Ochoa and Storey (2018). In the end we could not figure out the math and left it as a short mention in the discussion, but that would be something very nice to do. We also could not find a good way to use the residuals to develop some sort of measure of model fit at a purely individual level (instead of depending on the relationship between pairs of individuals, as it does right now), and that would also be very nice to do. Moreover, individual frequencies can also be calculated using principal component analyses, so this method could be expanded to work as an evaluation of a PCA as a description of population structure. Finally what we are looking forward to the most is to see how the method is applied to different datasets and how that helps gain new scientific insights.

What would your message be for students about to start developing or using novel techniques in Molecular Ecology? 

I am myself a student who has very recently started developing and using novel techniques in Molecular Ecology, so I am not sure if I have enough experience and perspective to give any useful advice. But based on my limited experience, I would say that it is important not to be afraid to jump into new areas or fields where we feel like we might have too limited experience, and that often what at first seems very difficult will become more and more accessible and doable as we work on it.

What have you learned about methods and resources development over the course of this project?

I started working on this study during my Master studies, so it has been one of my first research experiences. Basically all I know about method development I learned during the course of this project, from the more practical skills related to developing and implementing a method to how to explain it, and make it accessible to the community that might be interested in using it. I realized that this can actually be very important, since it will affect how many people end up using it. Also, as a user of bioinformatics methods I really appreciate when I use a new method if it is easy to use and does not create too many problems.

Describe the significance of this research for the general scientific community in one sentence.

It is important to consider the assumptions of the methods we use, since relevant violations of the assumptions might result in misleading or even meaningless results.    

Describe the significance of this research for your scientific community in one sentence.

It makes it possible and easy to evaluate the model fit of the admixture model at the individual-level in almost any context in which the admixture model is currently used, so it can be applied before concluding a population is a mixture of others, or it can help to choose a meaningful number of ancestral populations.

References

Lawson, D. J., Van Dorp L., and Falush, D.. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nature Communications 2018;9.1: 1-11.

Ochoa, A. and Storey, J. D. FST and kinship for arbitrary population structures I: Generalized definitions. BioRxiv 2016: 083915.