Abstract:
Machine learning has evolved continuously since its first appearance, driving progress in science and technology while increasingly attracting interdisciplinary attention beyond the scope of traditional computer science. A number of studies have applied machine learning to bioinformatics, obtaining valuable findings and validating the feasibility of using machine learning algorithms to solve biological problems. The difficulties and challenges in using machine learning techniques for biological problem-solving arise from the differences in data and tasks between computer science and biology. Training a sophisticated learning-based model is a systematic task that involves a series of steps. These steps can be broadly categorized into data and task aspects, each influencing the overall performance of the model. Appropriately adjusting the model based on the specific task during training is essential for effectively adapting machine learning algorithms to domain-specific challenges, which motivates us to develop advanced methods tailored to the biological field, allowing machine learning to be more successfully applied to problems that were initially solved using traditional computational methods in bioinformatics. This thesis presents our studies addressing biological problems across three topics: epigenomics, metagenomics, and infectious diseases. Solving problems in each area, we developed advanced machine learning frameworks based on considering corresponding biological characteristics, thereby improving machine learning algorithms for better performance. First, we address DNA methylation status identification within epigenomics by developing two new frameworks inspired by natural language processing techniques. These frameworks implement the detection of DNA methylation sites and provide biological insights through model interpretation. Second, regarding problems in metagenomics, we present a study that predicts the source of microbiome samples among ten different origins by training a sophisticated ensemble model on taxonomic and functional profiles generated by whole-genome shotgun metagenomics sequencing. Finally, we introduce a study on an infectious disease problem in the context of the COVID-19 pandemic, focusing on predicting patients mortality and exploring factors associated with disease severity. This study consists of multiple models trained on different types of datasets. These models jointly demonstrated the feasibility of predicting patients' status based on their diverse features and provided valuable insights during the early stages of a new infectious disease. In summary, this cumulative thesis assembles studies across multiple topics of biological problems, advancing current machine learning algorithms for more practical applications in bioinformatics.