Abstract:
The overall aim of biomedical research is to understand disease mechanisms and to provide a drug to eventually cure the disease. This challenging endeavour requires an early research phase that deals with identifying target genes or proteins playing an important role in the disease. At this stage one uses animal models mimicking human disease to determine differences between healthy and diseased animals. Once potential drug targets have been found, compounds are screened and promising compounds go into the preclinical phase where their efficacy and, most importantly, safety are assessed. Those having proven to be efficacious and safe proceed to toxicology where the maximum tolerable dosage is assessed in, mainly, non-rodent species. According to the Bundesministerium für Ernährung und Landwirtschaft, more than 2 million animals were used for animal testing in German laboratories in 2017. The majority of these animals were mice and rats but also dogs, cats and monkeys are model organisms used for testing. While it is commonly accepted that other mammalian species resemble human biology to a great extent, one has to bear in mind that there are species-specific differences. One of the aims of this thesis was to investigate how similar widely used model species are to human and to each other on a molecular level. For this purpose we assessed the relationship between protein sequence identity and gene expression correlation with an emphasis on mouse and rat. We found that the majority of genes are highly similar, both on sequence and gene expression level. There were, however, cases with low sequence identity but high expression correlation. These cases were investigated in greater detail and the hypothesis that sequences annotated in widely used databases like Ensembl, UniProt, or RefSeq, may contain errors or are incomplete, was confirmed. Therefore, we investigated whether sequence information from related species can be used to derive a target’s sequence in a species with poor annotation. The a&o-tool was developed to exploit sequence similarity between related species and short-read RNA-Seq data to refine or validate target sequences. Since longread RNA-Seq data would greatly improve the results as entire transcripts are sequenced as a whole, we conducted a pilot study for comparing short- and long-read sequencing data. Even though PacBio’s SMRT sequencing technology still shows some issues with respect to data quality, it is a very promising approach that is going to prove valuable for sequence refinement. Another important goal of this thesis was to develop a score to assess a human target’s conservation across several model species. Publicly available data on the homology relationships between genes and RNA-Seq data build the basis for this score. Using a set of presumably highly conserved genes in human and mouse, we found that the proposed score yields reasonable results. An enrichment of Gene Ontology terms further strengthened our confidence in the conservation score.