DBpedia NIF Dataset

DBpedia is primarily focus on representing factual knowledge as contained in Wikipedia infoboxes. A vast amount of information, however, is contained in the unstructured Wikipedia article texts. In order to broaden and deepen the amount of structured DBpedia data, we are going a step further.
With the representation of wiki pages in the NLP Interchange Format (NIF) we provide all information directly extractable from the HTML source code divided in three datasets:

  • nif-context: the full text of a page as context (including begin and end index)
  • nif-page-structure: the structure of the page in sections and paragraphs (titles, subsections etc.)
  • nif-text-links: all in-text links to other DBpedia resources as well as external references

These datasets will serve as the groundwork for further NLP fact extraction tasks to enrich the gathered knowledge of DBpedia.

Note: The first iteration of this extraction process only covers the abstracts of every wiki page as a trail run. Starting from release 2016-10, it is provided the whole wiki page text in the NIF format.

IRIs: As you will see in the examples below, opposed to the IRI regime used for other DBpedia datasets, we use queries containing the version of DBpedia under which these instances were extracted. 

If you find inconsistencies in these files, please contact the DBpedia mailing lists or the DBpedia association directly, thank you.

Downloads

A sample list of the most recent files are listed in the table below. The whole list of available languages could be find in the  DBpedia Databus platform as nif-contextnif-page-structure, and nif-text-links.

Language nif-context nif-page-structure nif-text-links
de .ttl .ttl .ttl
en .ttl .ttl .ttl
es .ttl .ttl .ttl
fr .ttl .ttl .ttl
it .ttl .ttl .ttl
ja .ttl .ttl .ttl
ko .ttl .ttl .ttl
pl .ttl .ttl .ttl
pt .ttl .ttl .ttl

The Ontology

The following Figure represents the main classes and properties of the NIF vocabulary

NIF ontology

Libraries

If you want to integrate the NIF library to your project it could be done by :

  • Adding the NIF maven library
  • Compiling by your own the NIF-lib github project.
  • Compiling the pyNIF-lib github project. 

Documentation

If you want to have a deepest understanding of NIF, the best way to do that is through the NIF documentation which provides the pointers to all the important resources for the NLP Interchange Format (NIF).


Example:

input text: "Anthropology is the study of humanity. Its main subdivisions are social anthropology and cultural anthropology, which describes the workings of societies around the world, linguistic anthropology, which investigates the influence of language in social life, and biological or physical anthropology, which concerns long-term development of the human organism. Archaeology, which studies past human cultures through investigation of physical evidence, is thought of as a branch of anthropology in the United States, although in Europe, it is viewed as a discipline in its own right, or grouped under related disciplines such as history."

The result will be the set of ttl files containing the context, page-structure and text-links information.

  • nif-context.ttl

The full text of a wiki page as the context for all subsequent information about this page.

dbr:Anthropology?dbpv=2016-04&nif=context     a     nif:#Context .

dbr:Anthropology?dbpv=2016-04&nif=context    nif:isString    "Anthropology is the study of humanity. Its main subdivisions are social anthropology and cultural anthropology, which describes the workings of societies around the world, linguistic anthropology, which investigates the influence of language in social life, and biological or physical anthropology, which concerns long-term development of the human organism. Archaeology, which studies past human cultures through investigation of physical evidence, is thought of as a branch of anthropology in the United States, although in Europe, it is viewed as a discipline in its own right, or grouped under related disciplines such as history." .

dbr:Anthropology?dbpv=2016-04&nif=context    nif:beginIndex    "0"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
dbr:Anthropology?dbpv=2016-04&nif=context    nif:endIndex      "634"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
dbr:Anthropology?dbpv=2016-04&nif=context    nif:sourceUrl     <http://en.wikipedia.org/wiki/Anthropology> .
dbr:Anthropology?dbpv=2016-04&nif=context    nif:predLang     <http://lexvo.org/id/iso639-3/eng> .

  • nif-page-structure​.ttl

The structure of the wiki page as nif:Structure instances, such as Section, Paragraph and Title.

dbr:Anthropology?dbpv=2016-04&nif=context    nif:hasSection    dbr:Anthropology?dbpv=2016-04&nif=section_0_634    .

dbr:Anthropology?dbpv=2016-04&nif=section_0_634    a    nif:Section    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:beginIndex    "0"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:endIndex    "634"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:referenceContext    dbr:Anthropology?dbpv=2016-04&nif=context    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:hasParagraph    dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:hasParagraph    dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_634    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:firstParagraph    dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    .
dbr:Anthropology?dbpv=2016-04&nif=section_0_634    nif:lastParagraph    dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_63    .

dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    a    nif:Paragraph    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    nif:beginIndex    "0"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    nif:endIndex    "330"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    nif:referenceContext    dbr:Anthropology?dbpv=2016-04&nif=context    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_330    nif:superString    dbr:Anthropology?dbpv=2016-04&nif=section_0_634    .

dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_634    a    nif:Paragraph    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_634    nif:beginIndex    "331"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_634    nif:endIndex    "634"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger>    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_634    nif:referenceContext    dbr:Anthropology?dbpv=2016-04&nif=context    .
dbr:Anthropology?dbpv=2016-04&nif=paragraph_331_634    nif:superString    dbr:Anthropology?dbpv=2016-04&nif=section_0_634    .

 

  • nif-text-links.ttl

All in-text links of a wiki page as nif:Word or nif:Phrase.

dbr:Anthropology?dbpv=2016-04&nif=word_29_37    a    nif:Word .
dbr:Anthropology?dbpv=2016-04&nif=word_29_37    nif:referenceContext    dbr:Anthropology?dbpv=2016-04&nif=context .
dbr:Anthropology?dbpv=2016-04&nif=word_29_37    nif:beginIndex    "29"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
dbr:Anthropology?dbpv=2016-04&nif=word_29_37    nif:endIndex    "37"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
dbr:Anthropology?dbpv=2016-04&nif=word_29_37    nif:superString    dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_634 .
dbr:Anthropology?dbpv=2016-04&nif=word_29_37    <http://www.w3.org/2005/11/its/rdf#taIdentRef>    dbr:Human .
dbr:Anthropology?dbpv=2016-04&nif=word_29_37    nif:anchorOf    "humanity" .

dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    a    nif:Phrase    .
dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    nif:referenceContext    dbr:Anthropology?dbpv=2016-04&nif=context .
dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    nif:beginIndex    "65"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    nif:endIndex    "84"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> .
dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    nif:superString    dbr:Anthropology?dbpv=2016-04&nif=paragraph_0_634 .
dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    <http://www.w3.org/2005/11/its/rdf#taIdentRef>    dbr:Social_anthropology .
dbr:Anthropology?dbpv=2016-04&nif=phrase_65_84    nif:anchorOf    "social anthropology" .

 

Publications