Abstract:
Machine learning has become an essential tool for analyzing, predicting, and understanding biological properties and processes. Machine learning models can substantially support the work of biologists by reducing the number of expensive and time-consuming experiments. They are able to uncover novel properties of biological systems and can be used to guide experiments. Machine learning models have been successfully applied to various tasks ranging from gene prediction to three-dimensional structure prediction of proteins. However, due to their lack of interpretability, many biologists put only little trust in the predictions made by computational models.
In this thesis, we show how to overcome the typical "black box" character of machine learning algorithms by presenting two novel interpretable approaches for classification and regression.
In the first part, we introduce YLoc, an interpretable classification approach for predicting the subcellular localization of proteins. YLoc is able to explain why a prediction was made by identifying the biological properties with the strongest influence on the prediction. We show that interpretable predictions made by YLoc help to understand a protein's localization and, moreover, can assist biologists in engineering the location of proteins. Furthermore, YLoc returns confidence scores, making it possible for biologists to define their level of trust in individual predictions.
In the second part, we show how our two novel confidence estimators, CONFINE and CONFIVE, can improve the interpretability of MHC-I-peptide binding prediction. In contrast to plain affinity values predicted by usual regression models, CONFINE and CONFIVE estimate affinity intervals, which provide a very natural interpretation of confidence. While low confidence predictions exhibit fairly large intervals, reliable predictions yield a very small range of affinities. We show that distinguishing between reliable and unreliable predictions is important for discovering and engineering reliable epitopes for vaccines.
The interpretable approaches presented in this thesis are a significant step forward towards making machine learning methods more transparent to the users and, thus, towards improving the acceptance of computational methods.