DBpedia Datasets for Natural Language Processing (NLP)


Each and every dataset from DBpedia is potentially useful for several Natural Language Processing (NLP) tasks. We describe here a few examples of how to use these datasets. Moreover, we describe a number of extended datasets that were generated during the creation of DBpedia Spotlight and other NLP-related projects.


In the context of this page, the word “resource” — as in DBpedia Resource — refers to an entity or concept identified by a DBpedia URI.


Contents

1. DBpedia Core Datasets


The core datasets from DBpedia include an ontology to model the extracted information from Wikipedia, general facts about extracted resources, as well as inter-language links. More information on the Core Datasets Page.

2. DBpedia NLP Datasets


The NLP Datasets were created by the DBpedia Spotlight team to support entity recognition and disambiguation tasks, among others. If you use DBpedia NLP data sets in your research, please cite:


  • Pablo N. Mendes, Max Jakob and Christian Bizer. DBpedia for NLP: A Multilingual Cross-domain Knowledge Base. Proceedings of the International Conference on Language Resources and Evaluation, LREC 2012, 21–27 May 2012, Istanbul, Turkey. ( PDF Documentpreprint ) bibtex

2.1. DBpedia Lexicalizations Dataset


Contains mappings between surface forms and URIs. A surface form is term that has been used to refer to an entity in text. Names and nicknames of people are examples of surface forms. We store the number of times a surface form was used to refer to a DBpedia resource in Wikipedia, and we compute statistics from that.


Created by the DBpedia Spotlight team.
Authors: Pablo N. Mendes, Max Jakob


Download.


Has been used by: DBpedia Lookup, DBpedia Spotlight


Example Data:

dbpedia:Apple_Inc. lexvo:label “Apple computer”@en graph:Apple_Inc.---Apple_computer .
graph:Apple_Inc.---Apple_computer :pmi “9.867346749590263”^^xsd:double :score .
dbpedia:Apple_Inc. lexvo:label “Apple, Inc”@en graph:Apple_Inc.---Apple,_Inc .
graph:Apple_Inc.---Apple,_Inc :pmi “9.867346749590263”^^xsd:double :score .

The data above describes the entity Apple_Inc. and two surface forms used to refer to it: “Apple Inc.” and “Apple computer”.

2.2. DBpedia Topic Signatures


We tokenize all Wikipedia paragraphs linking to DBpedia resources and aggregate them in a Vector Space Model of terms weighted by their co-occurrence with the target resource. We use those vectors to select the strongest related terms and build topic signatures for those entities.


Download.


Created by the DBpedia Spotlight team.
Authors: Pablo N. Mendes


Example Data:


Apple_Inc. +"Apple Inc." computer from mac
Apple_sauce +"Apple sauce" pudding butter pie
Apple_Records +"Apple Records" beatles album released

2.3. DBpedia Thematic Concepts


Thematic Concepts are DBpedia resources that are the main subject of a Wikipedia Category.


Created by the DBpedia Spotlight team.
Authors: Pablo N. Mendes, Max Jakob


Download.


Example Data:

dbpedia:Adolescence rdf:type skos:Concept
dbpedia:Adoption rdf:type skos:Concept
dbpedia:Biodiversity rdf:type skos:Concept

2.4. DBpedia People's Grammatical Genders


Can be used for anaphora resolution and coreference resolution tasks.


Created by the DBpedia Spotlight team.
Authors: Pablo N. Mendes


Download.


Example Data:

3. Example Queries


* Select all people with grammatical gender “female” related to the topic of “Politics”


 
There are no files on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2013-11-22 22:43:34 by Ibu Radempa