Abstract:
Gene regulation plays a pivotal role at all stages of organism development, in cell differentiation, and for maintaining homeostasis. Controlled spatial and temporal gene expression is achieved by means of complex and robust regulatory networks. A key event in maintaining such networks is the sequence specific protein-DNA recognition, which enables transcription factors to identify their respective binding sites.
Computational and structural biologists face intriguing challenges at three different levels when investigating gene regulation. First, the involvement of gene regulation in disease can be addressed by studying global effects of gene regulatory networks, which are visible at the level of systems. Furthermore, detecting the often short and variable transcription factor binding sites (TFBSs) in genomic DNA is not a trivial task, since the prediction of TFBSs and delineation of functional regulatory modules are conducted at the level of sequences. Finally, there is a challenge in understanding the factors governing transcription factor-DNA recognition, as the information needs to be collected at the molecular level. Structure-based methods provide detailed information about protein-DNA interactions at atomic resolution.
In this work, a versatile approach for computational analysis of the different levels of gene regulation, gradually zooming in from the global level of systems to the molecular level, is presented. Linking information related to gene regulation from the different levels can help in clarifying phenomena that are hard to explain using only one source of information. First, the influence of gene regulation is analyzed at the level of systems. A set of cancer-related target genes are identified using a novel integrative analysis pipeline. Microarray data, immunological data, and curated biological knowledge are brought together enabling extensive analysis of the underlying mechanisms controlling gene expression in cancer tissue. The transcription factor AP2 is suggested to play a key regulatory role in controlling a set of over-expressed melanoma-related genes. The computational results presented are supported by previously reported experimental evidence.
Zooming in to the level of sequences transcription factors orchestrating the expression of functionally related genes are identified in yeast and plant, which are two important model organisms for studying gene regulation. The pattern-finding algorithm Gibbs sampling is employed for discovering putative functional TFBSs in functionally related genes. The response element ACGCGT is found to be over-represented in DNA-repair genes in yeast, which supports the idea that the transcription factor MBP1 is involved in blocking replication of damaged DNA. The vital regulation of stem cells is explored in plant, providing preliminary computational evidence for TFBSs critical to stem cell differentiation.
The final transition is the step from analyzing gene regulation at the levels of systems and sequences to studying protein-DNA interactions at atomic detail. Structural data provides an additional source for gaining insight into the thermodynamic properties of sequence specific binding, which eventually directs gene regulation. A computational protocol for analyzing the effects that small base modifications have on the overall binding free energy is described. The computationally obtained results for mutating the thymine to uracil in transcription factor-DNA complexes agree well with previously reported experimental results, illustrating the applicability of the protocol. This is a first step towards using molecular modeling for constructing structure-based models of TFBSs.
Each individual level of this step-wise analysis provides crucial information needed to gain insight into the different aspects underlying complex regulatory control mechanisms. Analysis at the level of systems and networks is crucial for understanding global effects of gene regulation, the implications of gene regulation in disease, and for identifying sets of target genes. Sequence-based methods are used for discovering functional binding sites in gene regulatory regions for such sets of related genes, responsible for directing gene expression. Finally, structural analysis can explain ambiguities observed in sequence-based models, however, can only be applied to a limited number of protein-DNA complexes due to high computational requirements. An improved understanding of all aspects of gene regulation is inevitable for identifying key factors influencing organism development and disease.