Abstract:
Machine learning has enabled striking technological advances over the last decades and has the potential to transform many aspects of our lives. Its application is especially promising in the health domain, where it can improve our understanding of increasingly complex health data, accelerate processes such as diagnosis or risk assessment while also making them more objective, and enable a more personalized approach to medicine. At the same time, machine learning for health faces particular challenges. Health data is often temporal and heterogeneous, distributed across many institutions, and accessible only in modest amounts for a specific machine learning application. Consequently, machine learning for health requires generally robust methods capable of handling heterogeneous and limited data and models that are well-tailored to the task at hand. This thesis contributes to both of these aspects. It includes new methods for unsupervised domain adaptation, which were designed for high-dimensional molecular health data and improved prediction across heterogeneous datasets. As a concrete application example, these methods were applied to the problem of age prediction from DNA methylation data across tissues, where they improved age prediction on a tissue not used for model training compared to a non-adaptive reference model. In addition, this thesis includes robust models for the analysis of data from an early clinical trial evaluating the use of broadly neutralizing antibodies for the treatment of HIV, which were suitable to account for heterogeneity between patient groups despite a limited sample size. Another application-specific contribution was the development of robust models for the time-dependent prediction of mortality and early cytomegalovirus reactivation after hematopoietic cell transplantation. These models were validated in a prospective non-interventional clinical trial and demonstrated similar performance as experienced physicians in a pilot comparison. Finally, this thesis supported the development of the XplOit platform, a software platform that facilitates robust machine learning for health by semantically integrating heterogeneous datasets.