The Sequence Space of Natural Proteins

DSpace Repository


Dokumentart: Dissertation
Date: 2020-06-19
Language: English
Faculty: 7 Mathematisch-Naturwissenschaftliche Fakultät
Department: Informatik
Advisor: Lupas, Andrei (Prof. Dr.)
Day of Oral Examination: 2020-05-12
DDC Classifikation: 004 - Data processing and computer science
570 - Life sciences; biology
Keywords: Proteine , Evolution , Homologie , Konvergenz
Other Keywords: Zufallssequenz
protein sequence
random sequence
License: Publishing license excluding print on demand
Show full item record


Proteins carry out the majority of functions at the molecular level of all organisms. They are composed of amino acid sequences that upon folding assume specific structures, which are essential to perform their function. In contrast, the great majority of randomly generated amino acid sequences fails to fold into a defined structure and is not functional. In order to better understand functional proteins, the aim of this thesis is to determine general features of natural protein sequences by contrasting them to random sequence models. For this, three different approaches are applied. The first approach focuses on sequence features that are shared among all proteins, resulting in a global consideration of natural proteins. For this, the pairwise similarity between sequence fragments derived from a large data set of bacterial proteomes is analyzed. These similarities are interpreted as distances, indicative of how sequences are distributed over the space of all possible sequences. The results show that the great majority of distances between natural sequences coincide with those between random sequences of the same amino acid composition. The global occupation of sequence space by natural proteins is thus almost random, an observation that contrasts with the widespread concept of sequences organized into dense clusters defined by common descent. In fact, most related sequences share a similarity that is expected from the random sequence model. They are thus not more similar than random sequences, resulting in their wide distribution across sequence space. Most distances between natural sequences that remained unaccounted for by the random sequence model, can be associated with the different use of amino acids in individual proteins. Only few distances are found to be affected by common sequence motifs in non-related proteins. With this, the amino acid composition of individual proteins is demonstrated to be the most distinctive feature that characterizes natural protein sequences globally. Furthermore, common descent and divergent evolution are demonstrated to have no impact on the global occupation of sequence space, while convergent evolution is responsible for specific sequence motifs that are common in natural proteins. The second approach analyzes the range of sequence similarities that is associated with common descent. In contrast to the first approach that studies the global occupation of sequence space, here, the local one is of interest. For this, sequences in close proximity to individual query sequences are studied. With increasing distance to the query, the likelihood of common descent decreases, becoming uncertain at a range that has been coined the ‘twilight zone’. Previous studies validated common descent by structural similarity in order to estimate the boundaries of the twilight zone. The approach applied in this thesis determines these boundaries from the statistical significance of sequence similarity, thereby refining its definition. With the third approach, the characteristic amino acid composition of individual proteins was further studied at a local level. Given that proteins are generally composed of distinct structural and functional parts, their amino acid composition along the entire sequence was expected to fluctuate accordingly. However, the results of a random model based on the amino acid composition of domain-sized fragments are comparable to those of the model based on the composition of proteins. In contrast to the initial expectation, this finding suggests a homogeneous amino acid composition along individual protein sequences. Different reasons for this homogeneity are considered such as fold-specific recombination, topology and genomic context, which could not be associated to this finding. By analyzing the codon composition of protein domains it becomes clear that this homogeneity of amino acids is correlated to a homogeneous usage of codons. This suggests that amino acid composition may be modulated by codon bias, an effect that has been associated with expression level and translation efficiency in other studies. With this approach, structural constraints on amino acid composition could be contrasted with constraints that cause codon bias, two features of proteins that have been analyzed extensively before and are studied here jointly.

This item appears in the following Collection(s)