Abstract:
The development of frameworks that allow to state grammars for natural languages in a mathematically precise way is a core task of the field of computational linguistics. The same holds for the development of techniques for finding the syntactic structure of a sentence given a grammar, parsing. The focus of this thesis lies on data-driven parsing. In this area, one uses probabilistic grammars that are extracted from manually analyzed sentences coming from a treebank. The probability model can be used for disambiguation, i.e., for finding the best analysis of a sentence.
In the last decades, enormous progress has been achieved in the domain of data-driven parsing. Many current parsers are nevertheless still limited in an important aspect: They cannot handle discontinuous structures, a phenomenon which occurs especially frequently in languages with a free word order. This is due to the fact that those parsers are based on Probabilistic Context-Free Grammar (PCFG), a framework that cannot model discontinuities.
In this thesis, I propose the use of Probabilistic Simple Range Concatenation Grammar (PSRCG), a natural extension of PCFG, for data-driven parsing. Thereby, I bring together developments from different areas, namely research on parsing German, on the quantification of discontinuity in treebanks, and on formalisms which can model discontinuous structures. Not only theoretical aspects are treated. For the first time, all techniques for direct data-driven parsing of discontinuities have been implemented and tested in a real-world data-driven parsing setting. The parser output quality and the parsing speed are encouraging and prove the point of this work: An exploration of the landscape of formal grammars beyond Context-Free Grammar with regard to data-driven parsing is worth the effort for data-driven parsing and opens the way for many new developments in the future, both in parsing and beyond.