Transcript identification from deep sequencing data

Behr, Jonas

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

dc.contributor.advisor	Rätsch, Gunnar (Prof. Dr.)
dc.contributor.author	Behr, Jonas
dc.date.accessioned	2014-05-19T09:17:53Z
dc.date.available	2014-05-19T09:17:53Z
dc.date.issued	2014-04-22
dc.identifier.other	406520526	de_DE
dc.identifier.uri	http://hdl.handle.net/10900/52948
dc.identifier.uri	http://nbn-resolving.de/urn:nbn:de:bsz:21-dspace-529483	de_DE
dc.identifier.uri	http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-529486	de_DE
dc.description.abstract	Ribonucleic acid (RNA) sequences are polymeric molecules ubiquitous in every living cell. RNA molecules mediate the flow of information from the DNA sequence to most functional elements in the cell. Therefore, it is of great interest in biological and biomedical research to associate RNA molecules to a biological function and to understand mechanisms of their regulation. The goal of this study is the characterization of the RNA sequence composi- tion of biological samples (transcriptome) to facilitate the understanding of RNA function and regulation. Traditionally, a similar task has been addressed by algorithms called gene finding systems, predicting RNA sequences (transcripts) from features of the genomic DNA sequence. Lacking sufficient experimental evidence for most of the genes, these systems learn sequence patterns on a few genes with direct evidence to identify many additional genes in the genome. High-throughput sequencing of RNA (RNA-Seq) has recently become a powerful tech- nology in studying the transcriptome. This technology identifies millions of short RNA fragments (reads of ≈100 letters length), holding direct evidence for a large fraction of the genes. However, the analysis of RNA-Seq data faces profound challenges. Firstly, the distribution of RNA-Seq reads is highly uneven among genes, resulting in a considerable fraction of genes with very few reads and the stochastic nature of the technology leads to gaps even for well covered genes. To accurately predict transcripts in cases with incomplete evidence, we need to combine RNA-Seq evidence with features derived from the genomic DNA sequence. We therefore developed a method to learn the integration of both information sources and implemented this strategy as an extension of the gene finder mGene. The system, now called mGene.ngs, determines close approximations of potentially non-linear transformations for all features on the training set, such that the prediction performance is maximized. With this ability, which is to our knowledge unique among gene finding systems, mGene.ngs can not only learn complex relationships between the two mentioned information sources, but gains the flexibility to take many additional information sources into account. mGene.ngs has been independently evaluated within the context of an international competition (RGASP) for RNA-Seq-based reannotation and has shown very favourable performance for two out of three model organisms. Moreover, we generated and analyzed RNA-Seq-based annotations for 20 Arabidopsis thaliana strains, to facilitate a deeper understanding of phenotypic variation in this natural plant population. A second major challenge in transcriptome reconstruction lies in the complexity of the transcriptome itself. A process called alternative splicing generates multiple mature RNA sequences from a single primary RNA sequence by cutting out so-called introns, typically in a tightly regulated manner. Inference algorithms of almost all gene finding systems are limited to predict transcripts not overlapping in their genomic region of origin. To overcome this limitation, purely RNA-Seq-based approaches have been developed. However, biologically implausible assumptions or the neglect of available information often led to unsatisfactory results. A major contribution of this study is the integer optimization-based transcriptome reconstruction approach MiTie. MiTie utilizes a biologically motivated loss function, can take advantage of a priori known genome annotations and gains predictive power by considering multiple RNA-Seq samples simultaneously. Based on simulated data for the human genome as well as on an extensive RNA-Seq data set for the model organism Drosophila melanogaster we show that MiTie predicts transcripts significantly more accurate than state-of-the-art methods like Cufflinks and Trinity.	en
dc.language.iso	en	de_DE
dc.publisher	Universität Tübingen	de_DE
dc.rights	ubt-podok	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=de	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=en	en
dc.subject.classification	Bioinformatik , Genanalyse , Maschinelles Lernen	de_DE
dc.subject.ddc	004	de_DE
dc.subject.ddc	500	de_DE
dc.subject.ddc	570	de_DE
dc.subject.other	Genefinding	en
dc.subject.other	Transcript identification	en
dc.title	Transcript identification from deep sequencing data	en
dc.type	PhDThesis	de_DE
dcterms.dateAccepted	2014-04-22
utue.publikation.fachbereich	Informatik	de_DE
utue.publikation.fakultaet	7 Mathematisch-Naturwissenschaftliche Fakultät	de_DE

Dateien:	Dissertation_Jonas_Behr.pdf 6.72 MB PDF

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [5081]

Zur Kurzanzeige

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Transcript identification from deep sequencing data

DSpace Repositorium (Manakin basiert)

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto