Incremental and Surface-compositional Parsing of Coordination Ellipses

Kapfer, Jörg

Inkrementelles und oberflächenkompositionales Parsen von Koordinationsellipsen

Files

1821_JoergKapferDissertation.pdf (3.22 MB)

Language

de

Document Type

Doctoral Thesis

Issue Date

2011-07-28

Issue Year

2011

Authors

Kapfer, Jörg

Abstract

The term coordination ellipsis is being used in want of a better alternative, although it contradicts the aim of this thesis, which is to present an alternative to non-surface-compositional approaches, which postulate left-out elements in the elliptical conjunct and which try to reconstruct them so that all conjuncts can be said to have the same structure. Incremental parsing is crucial for a surface-compositional analysis of coordination ellipses because the structural representation of the first conjunct should be complete by the time the second conjunct is being analysed. Then the elliptical conjuct does not have to be a constituent itself. Rather, a conjunct can then consist of multiple constituents which do not have to be interconnected. It suffices to connect each constituent to its equivalent constituent in the first conjunct, where equivalence is defined in terms of substitution tests. In other words, the conjuncts need not be connected by a single coordination relation, but each pair of equivalent constituents (called primary coordinates) in the respective conjuncts can be connected by its own coordination relation. Linking two conjuncts by more than one coordination relation is only possible if the structural description provides sufficient detail to locate the respective primary coordinate in the analysis result of the first conjunct. An example for an unsuitable structural description would be one where the constituents are being recursively embedded, because in languages having a rich morphology and a gender system like German, primary coordinates have to be accessible without knowing the category of the constituent in which they are contained. Moreover, in German it is possible to leave out multiple consecutive words that do not belong to the same constituent. For example, if a noun phrase (NP) contains an adjective, the final noun of the NP can be left out together with the verb following that NP. For a suitable grammar formalism (SLIM, Hausser 2006), a parser (JSLIM) has been implemented in joint work with the author. The representation of the syntactical-semantical structure does not use recursive embedding, but rather a set of interconnected non-recursive feature structures, called proplets. Each proplet corresponds to a content word. The relationships to other proplets are coded as attributes (symbolic links) of the concerning word. When a SLIM rule is executed, the attributes of the latest word received by the hearer or reader are read and conditions for the matching of at most two of the proplets in the structural description that has been built so far (called sentence start) are being deduced. If these conditions are not met, the new word's proplet cannot be integrated. All proplets in the sentence start are accessible for the matching procedure. The rules developed in this thesis allow to formulate conditions covering more than two proplets in the sentence start if necessary. The matching between rules and proplets is organized in such a way that the conditions for proplets of words in a greater distance from the new word are being constrained by the attributes of proplets of words that are closer to it. To provide demanding test cases for the accessibility of proplets in the sentence start has been one of the motivations for the investigation of coordination ellipses. The presented JSLIM grammar for coordination ellipses in German contributes to the improvement of the formalism as well as the parser by showing not only which of the proplets should be directly accessible but also how the matching of unwanted proplets can be avoided. The latter not only speeds up processing but also helps to avoid errors in grammar development caused by too general conditions that cause the wrong proplets to match. In addition, it is shown how intermediate results as well as grammar rules and other constraints have to be designed in order to be able to recognize pairs of primary coordinates even if they are far apart while on the other hand preventing the construction of invalid edges to distant nodes.

Abstract

Die Bezeichnung Koordinationsellipse wird in Ermangelung einer besseren Alternative verwendet, auch wenn sie der Zielsetzung widerspricht, eine Alternative zu Ansätzen aufzuzeigen, die im elliptischen Konjunkt weggelassene Elemente postulieren und diese soweit zu rekonstruieren versuchen, dass für sämtliche Konjunkte die gleiche Struktur angenommen werden kann. Das inkrementelle Parsen ist insofern für eine oberflächenkompositionale Analyse von Koordinationsellipsen unerlässlich, als die Strukturbeschreibung des ersten Konjunkts bereits vollständig vorliegen sollte, sobald das zweite analysiert wird. Dies ermöglicht es, das elliptische Konjunkt nicht selbst als Konstituente zu analysieren. Ein Konjunkt kann somit aus mehreren Konstituenten bestehen, die nicht untereinander verbunden sein müssen. Diese werden nicht untereinander, sondern mit den äquivalenten Konstituenten im ersten Konjunkt verbunden. Die Äquivalenz wird hierbei durch Substitutionstests definiert. Die unterhalb der Konjunktebene bewerkstelligte Koordinierung von Konstituenten - im Folgenden primäre Koordinate genannt - setzt jedoch eine geeignete Strukturbeschreibung voraus, in der die im elliptischen Konjunkt verfügbaren Informationen ausreichen, um das jeweilige primäre Koordinat im ersten Konjunkt zu finden. Insbesondere ist eine Strukturbeschreibung ungeeignet, in der die Konstituenten rekursiv ineinander eingebettet sind, da in Sprachen, die wie das Deutsche über eine ausgeprägte Morphologie und ein Genussystem verfügen, ein Zugriff auf primäre Koordinate auch dann möglich sein muss, wenn die Kategorie der Konstituente, die sie enthält, nicht bekannt ist. Im Deutschen ist es zudem möglich, mehrere aufeinander folgende Wörter wegzulassen, die nicht derselben Konstituente angehören, wie z. B. das Substantiv am Ende einer Nominalphrase mit adjektivischem Attribut zusammen mit dem darauf folgenden Verb. Für ein geeignetes Grammatikmodell (SLIM; Hausser, 2006) wurde unter Mitwirkung des Verfassers ein Parser (JSLIM) implementiert. Die syntaktisch-semantische Struktur wird nicht durch rekursive Einbettung repräsentiert, sondern durch eine Menge miteinander verzeigerter nicht-rekursiver Merkmalstrukturen, genannt Proplets, wobei jedes Proplet einem Inhaltswort entspricht. Die Relationen zu anderen Proplets werden gleich den Eigenschaften des betreffenden Wortes in den Attributen angegeben. Bei Ausführung der SLIM-Regeln werden die Merkmale des jeweils neu eingelesenen Wortes erfasst und daraus Bedingungen für maximal zwei der Proplets in der bisher erstellten Strukturbeschreibung, genannt Satzanfang, abgeleitet, die erfüllt sein müssen, um das Neuwortproplet integrieren zu können. Grundsätzlich sind alle Proplets des Satzanfangs dieser Überprüfung zugänglich. Die im Rahmen der vorliegenden Arbeit entwickelten Regeln für Ellipsen formulieren Bedingungen bezüglich mehr als zwei Proplets. Der Abgleich zwischen Regeln und Proplets ist so organisiert, dass die Bedingungen für die Proplets der weiter vom zuletzt eingelesenen Wort entfernten Wörter durch die Merkmale weniger weit entfernter Proplets zusätzlich eingeschränkt werden können. Koordinationsellipsen werden u. a. deshalb untersucht, weil sie hohe Ansprüche bezüglich der Zugänglichkeit der Proplets im Satzanfang stellen. Mit der vorgelegten JSLIM-Grammatik für Koordinationsellipsen im Deutschen wird ein Beitrag zur Weiterentwicklung des Formalismus und des Parsers geleistet, indem gezeigt wird, welche der Proplets direkt und welche nur indirekt zugänglich sein sollten, um einerseits eine schnellere Verarbeitung zu ermöglichen und andererseits die Grammatikentwicklung zu erleichtern, indem Fehler vermieden werden, die dadurch entstehen, dass aufgrund zu allgemeiner Bedingungen in den Regeln auf die falschen Proplets zugegriffen wird. Zudem wird gezeigt, wie sowohl die Zwischenresultate als auch die Grammatikregeln und sonstigen Beschränkungen beschaffen sein müssen, um einerseits weit auseinanderliegende primäre Koordinatepaare erkennen zu können, und andererseits zu verhindern, dass unzulässige Kanten zu weit entfernten Knoten konstruiert werden.

Inkrementelles und oberflächenkompositionales Parsen von Koordinationsellipsen

Files

Language

Document Type

Issue Date

Issue Year

Authors

Editor

Abstract

Abstract

URI

DOI

URN

Document's Licence

Faculties & Collections

Zugehörige ORCIDs