Computational Methods for Interpretable Analysis of Uncertain and Incomplete High-Dimensional Biological Data

DSpace Repositorium (Manakin basiert)


Dateien:

Zitierfähiger Link (URI): http://hdl.handle.net/10900/172023
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1720235
Dokumentart: Dissertation
Erscheinungsdatum: 2025-11-10
Sprache: Englisch
Fakultät: 7 Mathematisch-Naturwissenschaftliche Fakultät
Fachbereich: Informatik
Gutachter: Nieselt, Kay (Prof. Dr.)
Tag der mündl. Prüfung: 2025-10-17
DDC-Klassifikation: 004 - Informatik
500 - Naturwissenschaften
570 - Biowissenschaften, Biologie
Lizenz: http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en
Zur Langanzeige

Abstract:

The rapidly growing availability of extensive "omics" data—including genomics, transcriptomics, proteomics, and metagenomics—opens new perspectives for systems-oriented biological research. Concurrently, characteristic properties of these data, such as high dimensionality, measurement noise, and incomplete datasets, pose significant analytical challenges as they can obscure biological signals and complicate the application of statistical methods. Although machine learning provides effective methods for modeling complex relationships, its efficacy is often limited by these data challenges. Issues like the "curse of dimensionality", problems with model interpretability, and susceptibility to systematic errors further exacerbate this situation. The reliable identification of biologically relevant patterns therefore ranks among the central challenges of modern bioinformatics. This dissertation addresses these challenges through the development and application of novel computational strategies based on the fundamental principle of targeted information refinement and reduction, aiming to enable robust biological conclusions from complex omics data. A particular focus lies on accounting for measurement uncertainties and missing values during the dimensionality reduction (DR) of high-dimensional data. Common DR methods, often primarily used for visualization, typically neglect uncertainties in the input data, even though biological measurements are frequently affected by technical noise or missing data. Acknowledging and propagating these uncertainties is therefore crucial for ensuring that conclusions drawn from low-dimensional representations, particularly visual interpretations of low-dimensional scatter plots, are robust and trustworthy. First, this work presents extensions to established DR methods: VIPurPCA and an uncertainty-aware t-SNE framework facilitate the error propagation of measurement uncertainties through Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), respectively. Both approaches rely on approximate Gaussian error propagation for computational efficiency; the necessary derivatives are computed using automatic differentiation. To handle the iterative nature of t-SNE, the implicit function theorem is additionally employed. Complementary intuitive visualizations, such as animated scatter plots, are developed to enable a direct assessment of the reliability of low-dimensional embeddings. Beyond measurement noise, the complete absence of data—missing values—presents a distinct and often more severe challenge for analysis. A domain where sparse data is particularly pronounced is ancient genomics. Ancient DNA (aDNA) samples often exhibit low quality and quantity, resulting in significant data gaps and thus incomplete genotype information. However, the impact of these missing data points on the stability and reliability of the resulting PCA projection is often overlooked or not formally quantified. To address this, TrustPCA, a specialized probabilistic method and web tool, is introduced. TrustPCA quantifies and visualizes the uncertainties in PCA projections that specifically arise from such missing genotypes. By providing confidence ellipses for these PCA projections, TrustPCA directly enhances the conclusiveness and trustworthiness of population genetic analyses conducted on sparse aDNA. Beyond individual data uncertainties, the overall structure and distribution of training data also significantly influence the performance of machine learning methods in general. This work addresses a previously underappreciated problem in the taxonomic classification of DNA sequences: despite a balanced distribution of classes in the training data, imbalances can occur within the feature space—i.e., densely and sparsely covered regions—which particularly impair the performance of classification models. To address this issue, a method for feature space balancing is proposed. This involves a targeted, strategic subsampling of the training data with the aim of achieving a more uniform distribution in the feature space. This approach significantly improves the generalization capability and classification performance of simple, resource-efficient ML models—in specific use cases, even surpassing more complex deep learning approaches. Complementing such sequence classification, understanding the broader functional and evolutionary context of specific genes often requires dedicated analytical tools. Therefore, BLASTphylo is presented as an interactive web tool that automates the extensive pipeline from one of the most commonly used methods in bioinformatics, BLAST, to the refined visualization of taxonomic distributions and phylogenetic relationships of bacterial homologous genes. This enables more efficient, accessible, and insightful comparative analyses. In their entirety, the methods and tools developed herein demonstrate how the guiding principle of strategic information refinement and reduction can be effectively applied to overcome central bioinformatics challenges. By improving the trustworthiness, reproducibility, and interpretability of computational analyses, this dissertation contributes to more precise and insightful biological discoveries from diverse high-dimensional datasets.

Das Dokument erscheint in: