SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)
Dateien
Datum
Autor:innen
Herausgeber:innen
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
URI (zitierfähiger Link)
DOI (zitierfähiger Link)
Internationale Patentnummer
Link zur Lizenz
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Sammlungen
Core Facility der Universität Konstanz
Titel in einer weiteren Sprache
Publikationstyp
Publikationsstatus
Erschienen in
Zusammenfassung
Extracting titles from a PDF’s full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF’s title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ‘academic search engine’ scenario and better run times (8:19 minutes vs. 57:26 minutes).
Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
Schlagwörter
Konferenz
Rezension
Zitieren
ISO 690
BEEL, Jöran, Bela GIPP, Ammar SHAKER, Nick FRIEDRICH, 2010. SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size). ECDL 2010. Glasgow, 6. Sep. 2010 - 10. Sep. 2010. In: MOUNIA LALMAS, , ed. and others. Research and advanced technology for digital libraries :14th European Conference, ECDL 2010, Glasgow, UK, September 6 - 10, 2010; proceedings. Berlin [u.a.]: Springer, 2010, pp. 413-416. Lecture Notes in Computer Science. 6273. ISBN 978-3-642-15463-8. Available under: doi: 10.1007/978-3-642-15464-5_45BibTex
@inproceedings{Beel2010SciPl-30892, year={2010}, doi={10.1007/978-3-642-15464-5_45}, title={SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)}, number={6273}, isbn={978-3-642-15463-8}, publisher={Springer}, address={Berlin [u.a.]}, series={Lecture Notes in Computer Science}, booktitle={Research and advanced technology for digital libraries :14th European Conference, ECDL 2010, Glasgow, UK, September 6 - 10, 2010; proceedings}, pages={413--416}, editor={Mounia Lalmas}, author={Beel, Jöran and Gipp, Bela and Shaker, Ammar and Friedrich, Nick} }
RDF
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:void="http://rdfs.org/ns/void#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/30892"> <foaf:homepage rdf:resource="http://localhost:8080/"/> <dcterms:title>SciPlore Xtract : Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size)</dcterms:title> <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/30892/1/Beel_0-285747.pdf"/> <bibo:uri rdf:resource="http://kops.uni-konstanz.de/handle/123456789/30892"/> <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-05-06T09:18:48Z</dcterms:available> <dc:rights>terms-of-use</dc:rights> <dc:creator>Shaker, Ammar</dc:creator> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <dc:contributor>Gipp, Bela</dc:contributor> <dc:contributor>Shaker, Ammar</dc:contributor> <dc:creator>Friedrich, Nick</dc:creator> <dc:language>eng</dc:language> <dc:creator>Gipp, Bela</dc:creator> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/30892/1/Beel_0-285747.pdf"/> <dcterms:abstract xml:lang="eng">Extracting titles from a PDF’s full text is an important task in information retrieval to identify PDFs. Existing approaches apply complicated and expensive (in terms of calculating power) machine learning algorithms such as Support Vector Machines and Conditional Random Fields. In this paper we present a simple rule based heuristic, which considers style information (font size) to identify a PDF’s title. In a first experiment we show that this heuristic delivers better results (77.9% accuracy) than a support vector machine by CiteSeer (69.4% accuracy) in an ‘academic search engine’ scenario and better run times (8:19 minutes vs. 57:26 minutes).</dcterms:abstract> <dc:creator>Beel, Jöran</dc:creator> <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-05-06T09:18:48Z</dc:date> <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/> <dcterms:issued>2010</dcterms:issued> <dc:contributor>Beel, Jöran</dc:contributor> <dc:contributor>Friedrich, Nick</dc:contributor> </rdf:Description> </rdf:RDF>