Abstract:
Machine learning is an ever-growing scientific field with increasing impact on our lives and has already revolutionized areas such as speech recognition, natural language processing and image classification. Machine learning is also of great interest to population genetics, especially as next-generation sequencing methods provide ever-larger genomic data sets that challenge traditional model-based estimators. In addition, new simulation software allows efficient generation of training data. While machine learning is already widely applied in population genetics, this emerging methodology also poses challenges, particularly with respect to robustness and interpretability. A promising strategy to address these challenges is to incorporate theoretical concepts and models from population genetics into machine learning methods. In this thesis, we present two approaches and demonstrate their advantages: First, by using a neural network to estimate the scaled mutation rate, we present a concept of how well-established model-based estimators can be integrated into the loss functions of supervised methods. Second, we incorporate key population genetic concepts such as the fixation index and Hardy-Weinberg equilibrium into an unsupervised hierarchical soft clustering method to infer population ancestry. These methods demonstrate that combining theoretical insights with data-driven learning not only enables processing of large data sets, capturing and exploiting underlying data dynamics, but also improves robustness and interpretability. With the ever-increasing availability of genetic data, such approaches have enormous potential to significantly deepen our understanding of genetic variability and evolution.