Abstract:
Mass spectrometry coupled to liquid chromatography (LC-MS) is an analytical technique becoming increasingly popular in biomedical research. Especially in high-throughput proteomics and metabolomics mass spectrometry is widely used because it provides both qualitative and quantitative information about analytes. The standard protocol is that complex analyte mixtures are first separated in liquid chromatography and then analyzed using mass spectrometry. Finally, computational tools extract all relevant information from the large amounts of data produced. This thesis aims at improving computational analysis of LC-MS data|we present two novel computational methods and a software framework for the development of LC-MS data analysis tools.
In the first part of this thesis we present a quantitation algorithm for peptide signals in isotope-resolved LC-MS data. Exact quantitation of all peptide signals (so-called peptide features) is an essential step in most LC-MS data analysis pipelines. Our algorithm detects and quantifies peptide features in centroided peak maps using a multi-phase approach: First, putative feature centroid peaks, so-called seeds, are determined based on signal properties that are typical for peptide features. In the second phase, the seeds are extended to feature regions, which are compared to a theoretical feature model in the third phase. Features that show a high correlation between measured data and the theoretical model are added to a feature candidate list. In a last phase, contradicting feature candidates are detected and contradictions are resolved. In a comparative study, we show that our algorithm outperforms several state of-the-art algorithms, especially on complex datasets with many overlapping peaks.
The second part of this thesis introduces a novel machine learning approach for modeling chromatographic retention of DNA in ion-pair reverse-phase liquid chromatography. The retention time of DNA is of interest for many biological applications, e.g., for quality control of DNA synthesis and DNA amplification. Most existing models use only the base composition to model chromatographic retention of DNA. Our model complements the base composition with secondary structure information to improve the prediction performance. A second difference to previous models is the use of a support vector regression model instead of simple linear or logarithmic models. In a thorough evaluation, we show that these changes significantly improve the prediction performance, especially at temperatures below 60°C. As a by-product, our approach allows the creation of a temperature-independent model, which can predict DNA retention times not only for a fixed temperature, but for all temperatures within the temperature range of the training data.
Finally, we present OpenMS - a framework for computational mass spectrometry. OpenMS provides data structures and algorithms for the rapid development of mass spectrometry data analysis software. Rapid software prototyping is especially important in this area of research because both instrumentation and experimental procedures are quickly evolving. Thus, new analysis tools have to be developed frequently. OpenMS facilitates software development for mass spectrometry by providing a rich functionality ranging from support for many file formats, over customizable data structures and data visualization, to sophisticated algorithms for all major data analysis steps. The peptide feature quantitation algorithm presented in the first part of this thesis is one of many algorithms provided by OpenMS.
We demonstrate the benefits of using OpenMS by the development of TOPP - The OpenMS Proteomics Pipeline. TOPP is a collection of command line tools which each perform one atomic data analysis step|typically one of the OpenMS data analysis algorithms. The individual TOPP tools are used as building blocks for customized analysis pipelines. This kind of exibility and a graphical user interface for the visual creation of analysis pipelines make TOPP a versatile instrument for LC-MS data analysis.