Building Computational Resources : The URDU.KON-TB Treebank and the Urdu Parser

Lade...
Vorschaubild
Dateien
Abbas_290530.pdf
Abbas_290530.pdfGröße: 3.01 MBDownloads: 1521
Datum
2014
Autor:innen
Herausgeber:innen
Kontakt
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
DOI (zitierfähiger Link)
ArXiv-ID
Internationale Patentnummer
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Open Access Green
Core Facility der Universität Konstanz
Gesperrt bis
Titel in einer weiteren Sprache
Forschungsvorhaben
Organisationseinheiten
Zeitschriftenheft
Publikationstyp
Dissertation
Publikationsstatus
Published
Erschienen in
Zusammenfassung

This work presents the development of the URDU.KON-TB treebank, its annotation evaluation & guidelines and the construction of the Urdu parser for a South Asian language Urdu. Urdu is comparatively an under-resourced language and the development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The work includes the construction of the raw corpus containing 1400 sentences collected from Urdu Wikipedia and the Jang newspaper. The corpus contains text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. The hierarchal annotation scheme adopted has a combination of phrase structure and hyper dependency structure. A semi-semantic part of speech tag set, a semi-semantic syntactic tag set and a functional tag set are proposed, which are further revised during the annotation of the raw corpus. The annotation of the sentences was performed manually. Due to the addition of morphology, part of speech, syntactical, semantical, clausal, grammatical and miscellaneous features, the annotation scheme is linguistically rich. The annotation resulted in a treebank for Urdu, called the URDU.KON-TB. This is presented in Chapter 3.



For an evaluation of the annotation scheme, Krippendorff's Alpha coefficient is selected. This is a statistical measure to evaluate inter-annotator agreement. Randomly selected 100 sentences from the URDU.KON-TB treebank were given to five trained annotators for annotation. The annotated sentences then evaluated using the Krippendorff's Alpha coefficient. The alpha values of inter-annotator agreement obtained for part of speech, syntactical and functional annotation are 0.964, 0.817 and 0.806, respectively. The evaluation is presented in Chapter 4. All of the three values lie in the range of perfect agreement. The annotation guidelines devised in the development of the URDU.KON-TB treebank were revised during and after this annotation evaluation. The updated version is presented in Chapter 2.



For the development of an Urdu parser, 1400 annotated sentences in the URDU.KON-TB treebank are divided into 80% training data and 20% test data. A context free grammar is extracted from this training data, which is then given to the Urdu parser after its development. The test data is divided into 10% held out data and 10% test data. The test data then contains 140 sentences with an average length of 13.73 words per sentence. The held out data is used during the development of the Urdu parser. Urdu parser is an extended version of dynamic programming algorithm known as the Earley parsing algorithm. The extensions made are discussed in Chapter 5 along with the issues faced during the development. All items which can occur in a normal text are considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements of prophets), anaphora with in a sentence, and others. The PARSEVAL measures are used to evaluate the results of the Urdu parser. By applying a sufficiently rich grammar along with the extended parsing model, the parser gives 87% of f-score and outperforms the multi-path-shift-reduce parser for Urdu, a two stage Hindi dependency parser and a simple Hindi dependency parser with 4.8%, 12.48% and 22% increase in recall, respectively.



The URDU.KON-TB treebank and the Urdu parser is a contribution to the overall computational resources of Urdu. By products of this work are a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, or pattern matching.

Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
004 Informatik
Schlagwörter
URDU.KON.TB Treebank, Urdu Treebank Statistical Evaluation, Urdu Parser, Semi-Semantic Part of Speech Tagset, Semi-Semantic Syntactic Tagset, Functional Tagset
Konferenz
Rezension
undefined / . - undefined, undefined
Zitieren
ISO 690ABBAS, Qaiser, 2014. Building Computational Resources : The URDU.KON-TB Treebank and the Urdu Parser [Dissertation]. Konstanz: University of Konstanz
BibTex
@phdthesis{Abbas2014Build-29053,
  year={2014},
  title={Building Computational Resources : The URDU.KON-TB Treebank and the Urdu Parser},
  author={Abbas, Qaiser},
  address={Konstanz},
  school={Universität Konstanz}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/29053">
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/29053/1/Abbas_290530.pdf"/>
    <dc:contributor>Abbas, Qaiser</dc:contributor>
    <dcterms:abstract xml:lang="eng">This work presents the development of the URDU.KON-TB treebank, its annotation evaluation &amp; guidelines and the construction of the Urdu parser for a South Asian language Urdu. Urdu is comparatively an under-resourced language and the development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The work includes the construction of the raw corpus containing 1400 sentences collected from Urdu Wikipedia and the Jang newspaper. The corpus contains text of local &amp; international news, social stories, sports, culture, finance, religion, traveling, etc. The hierarchal annotation scheme adopted has a combination of phrase structure and hyper dependency structure. A semi-semantic part of speech tag set, a semi-semantic syntactic tag set and a functional tag set are proposed, which are further revised during the annotation of the raw corpus. The annotation of the sentences was performed manually. Due to the addition of morphology, part of speech, syntactical, semantical, clausal, grammatical and miscellaneous features, the annotation scheme is linguistically rich. The annotation resulted in a treebank for Urdu, called the URDU.KON-TB. This is presented in Chapter 3.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For an evaluation of the annotation scheme, Krippendorff's Alpha coefficient is selected. This is a statistical measure to evaluate inter-annotator agreement. Randomly selected 100 sentences from the URDU.KON-TB treebank were given to five trained annotators for annotation. The annotated sentences then evaluated using the Krippendorff's Alpha coefficient. The alpha values of inter-annotator agreement obtained for part of speech, syntactical and functional annotation are 0.964, 0.817 and 0.806, respectively. The evaluation is presented in Chapter 4. All of the three values lie in the range of perfect agreement. The annotation guidelines devised in the development of the URDU.KON-TB treebank were revised during and after this annotation evaluation. The updated version is presented in Chapter 2.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;For the development of an Urdu parser, 1400 annotated sentences in the URDU.KON-TB treebank are divided into 80% training data and 20% test data. A context free grammar is extracted from this training data, which is then given to the Urdu parser after its development. The test data is divided into 10% held out data and 10% test data. The test data then contains 140 sentences with an average length of 13.73 words per sentence. The held out data is used during the development of the Urdu parser. Urdu parser is an extended version of dynamic programming algorithm known as the Earley parsing algorithm. The extensions made are discussed in Chapter 5 along with the issues faced during the development. All items which can occur in a normal text are considered, e.g., punctuation, null elements, diacritics, headings, regard titles, Hadees (the statements of prophets), anaphora with in a sentence, and others. The PARSEVAL measures are used to evaluate the results of the Urdu parser. By applying a sufficiently rich grammar along with the extended parsing model, the parser gives 87% of f-score and outperforms the multi-path-shift-reduce parser for Urdu, a two stage Hindi dependency parser and a simple Hindi dependency parser with 4.8%, 12.48% and 22% increase in recall, respectively.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The URDU.KON-TB treebank and the Urdu parser is a contribution to the overall computational resources of Urdu. By products of this work are a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, or pattern matching.</dcterms:abstract>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/45"/>
    <dc:rights>terms-of-use</dc:rights>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-09-29T12:42:31Z</dcterms:available>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <bibo:uri rdf:resource="http://kops.uni-konstanz.de/handle/123456789/29053"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-09-29T12:42:31Z</dc:date>
    <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/>
    <dcterms:title>Building Computational Resources : The URDU.KON-TB Treebank and the Urdu Parser</dcterms:title>
    <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/29053/1/Abbas_290530.pdf"/>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dc:creator>Abbas, Qaiser</dc:creator>
    <dc:language>eng</dc:language>
    <dcterms:issued>2014</dcterms:issued>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/45"/>
  </rdf:Description>
</rdf:RDF>
Interner Vermerk
xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter
Kontakt
URL der Originalveröffentl.
Prüfdatum der URL
Prüfungsdatum der Dissertation
September 17, 2014
Finanzierungsart
Kommentar zur Publikation
Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Begutachtet
Diese Publikation teilen