Verteilungsansätze von großen Datenmengen

Graf, Sebastian

Verteilungsansätze von großen Datenmengen

Dateien

Verteilungsansaetze_von_grossen_Datenmengen.pdfGröße: 5.89 MBDownloads: 462

Datum

2008

Open Access-Veröffentlichung

Open Access Green

Sammlungen

Informatik und Informationswissenschaft

Titel in einer weiteren Sprache

Distributing large amounts of data

Publikationstyp

Masterarbeit/Diplomarbeit

Publikationsstatus

Published

Zusammenfassung

The era of single-core processors comes to an end. Only a few modern computer systems own less than two cores nowadays. To use these latterly parallel available ressources in an optimal way, the usage of data must be adapted. This adaption covers the distribution of the data. This thesis at hand is addressed to this aspect with respect to the evaluation of text-based data formats. More precisely, distributed queries are presented based on Comma Separated Values (CSV), on Extended Markup Language (XML)-based data regarding the string representation and on Extended Markup Language (XML)-based data with respect to the structure. Multiple variants for partitioning the data are presented for each approach. Especially the fragmentation of XML-based data in consideration of the structure shows the dependency between the structure itself and different approaches for partitioning the data. Therefore a possibility to generate a consistent fragmentation which is independent from the structure is presented. Distributed queries on well-known, fragmented XML-databases like wikipedia, treebank, xmark and dblp show the beneﬁts of these approaches. Distributed XPath -queries need, depending on the fragmentation and the available ressources less than half of the time if a not-distributed query. Based on these results, further optimizations can be done. Especially the query could be improved by the usage of Pipelining on XPath.

Zusammenfassung in einer weiteren Sprache

Die Ära der Prozessoren mit einem Kern neigt sich dem Ende zu. Kaum ein modernes Computersystem besitzt heute noch weniger als zwei Kerne. Um diese neuerdings parallel verfügbaren Ressourcen optimal nutzen zu können, muss der Umgang mit Daten entsprechend angepasst werden. Diese Anpassung beinhaltet die Verteilung der zu evaluierenden Daten. Die vorliegende Arbeit befasst sich mit diesem Aspekt in Hinsicht auf die Evaluierung textbasierter Datenformate. Konkret werden verteilte Anfragen auf Comma Separated Values (CSV), auf Extended Markup Language (XML)-basierten Daten in der Zeichenkettenrepräsentation und auf Extended Markup Language (XML)-basierten Daten unter Berücksichtigung der Struktur vorgestellt. Dabei werden zu jedem der Ansätze verschiedene Partitionierungs-Varianten vorgestellt und verglichen. Speziell bei der Zerlegung von XML-basierten Daten auf Basis der Struktur wird die Abhängigkeit zwischen verschiedenen Verteilungsansätzen und genau dieser Struktur offensichtlich. Dazu wird eine Möglichkeit vorgeschlagen, wie unabhängig von der Struktur der Daten eine einheitliche Fragmentierung erreicht werden kann. Verteilte Anfragen auf fragmentierte bekannte XML-Datenbanken wie wikipedia, treebank, xmark und dblp zeigen die Vorteile dieser Ansätze auf. So benötigen verteilte XPath -Anfragen, abhängig von der Aufteilung und von verfügbaren Ressourcen, weniger als die Hälfte der Zeit lokaler Evaluationen. Basierend auf diesen Ergebnissen können weitere Optimierungen vorgenommen werden. Insbesondere die Anfrage könnte durch das Verwenden von Pipelining auf XPath weiter verbessert werden.

Fachgebiet (DDC)

004 Informatik

Schlagwörter

Verteilung, Distribution

Zitieren

ISO 690

GRAF, Sebastian, 2008. Verteilungsansätze von großen Datenmengen [Master thesis]

BibTex

@mastersthesis{Graf2008Verte-5557,
  year={2008},
  title={Verteilungsansätze von großen Datenmengen},
  author={Graf, Sebastian}
}

RDF

<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/5557">
    <dc:language>deu</dc:language>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/5557/1/Verteilungsansaetze_von_grossen_Datenmengen.pdf"/>
    <dcterms:issued>2008</dcterms:issued>
    <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/5557/1/Verteilungsansaetze_von_grossen_Datenmengen.pdf"/>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dc:contributor>Graf, Sebastian</dc:contributor>
    <dcterms:title>Verteilungsansätze von großen Datenmengen</dcterms:title>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2011-03-24T15:56:24Z</dcterms:available>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dc:creator>Graf, Sebastian</dc:creator>
    <dcterms:rights rdf:resource="http://creativecommons.org/licenses/by-nc-nd/2.0/"/>
    <dcterms:abstract xml:lang="eng">The era of single-core processors comes to an end. Only a few modern computer systems own less than two cores nowadays. To use these latterly parallel available ressources in an optimal way, the usage of data must be adapted. This adaption covers the distribution of the data. This thesis at hand is addressed to this aspect with respect to the evaluation of text-based data formats. More precisely, distributed queries are presented based on Comma Separated Values (CSV), on Extended Markup Language (XML)-based data regarding the string representation and on Extended Markup Language (XML)-based data with respect to the structure. Multiple variants for partitioning the data are presented for each approach. Especially the fragmentation of XML-based data in consideration of the structure shows the dependency between the structure itself and different approaches for partitioning the data. Therefore a possibility to generate a consistent fragmentation which is independent from the structure is presented. Distributed queries on well-known, fragmented XML-databases like wikipedia, treebank, xmark and dblp show the beneﬁts of these approaches. Distributed XPath -queries need, depending on the fragmentation and the available ressources less than half of the time if a not-distributed query. Based on these results, further optimizations can be done. Especially the query could be improved by the usage of Pipelining on XPath.</dcterms:abstract>
    <dc:rights>Attribution-NonCommercial-NoDerivs 2.0 Generic</dc:rights>
    <bibo:uri rdf:resource="http://kops.uni-konstanz.de/handle/123456789/5557"/>
    <dc:format>application/pdf</dc:format>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2011-03-24T15:56:24Z</dc:date>
    <dcterms:alternative>Distributing large amounts of data</dcterms:alternative>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
  </rdf:Description>
</rdf:RDF>

Universitätsbibliographie

Ja