Latest Core Dataset Releases

Publication Year: 
2020

What data is extracted by DBpedia?

We follow an Extraction as a Platform (EaaP) approach. In regular intervals (normally each month), we automatically run the DBpedia extraction framework over the Wikipedia (all languages) and Wikidata dumps to extract around 5000 files packaged in 50 artifacts  and 4 high-level groups: Generic (using generic parsers and properties), Mappings (using editable ontology mappings from mappings.dbpedia.org), Text (abstract and article full-text extraction), Wikidata (mapped and cleaned Wikidata data).

A small part of this data (approx. 100 of 4000 files or 2.5%) is then selected into the latest-core collection. Latest-core is the equivalent of the "core" folder in previous releases: http://downloads.dbpedia.org/2016-10/core/ . This is the folder loaded into the Main SPARQL endpoint. The process now is that we will fork latest-core at certain intervals into stable collections, which are then loaded.  

A full description can be found in Hofer et al., The New DBpedia Release Cycle: Increasing Agility and Efficiency in Knowledge Extraction Workflows, SEMANTiCS 2020 (submitted).

Feedback and debugging

Running the extraction each month via Extraction as a Platform means that our community and consumers can help us debug and extend the extraction via Github or the DBpedia Forum. A preliminary guide on How to Improve DBpedia is available. Any changes reaching the master branch will be available for the next monthly release.

Please check the What is missing? section below!

About the Latest-Core Collection

Documentation is found via the collection link: https://databus.dbpedia.org/dbpedia/collections/latest-core

The collection updates automatically and always refers to the latest available files. If you would like to customize it, it is advised to create your own Databus collection: 1. register/login 2. go to the collection and click "Action" -> "Edit Copy"

Download

How to retrieve the data manually

Go to https://databus.dbpedia.org/dbpedia/collections/latest-core and click on the individual download links.

 

How to retrieve data automatically

  1. Retrieve the data query
    • Visit the collection page and click on Actions > Copy Query to Clipboard 
    • or run curl https://databus.dbpedia.org/dbpedia/collections/latest-core -H "accept: text/sparql" > query.sparql
  2. Select one of the following options:
    • Run the query against https://databus.dbpedia.org/repo/sparql to get the list of downloadable files (make sure to use a POST request, since the request length exceeds the maximum length of a GET request)
      curl -X POST --data-urlencode query@query.sparql -d format=text/tab-separated-values  https://databus.dbpedia.org/repo/sparql
      The query will return a list of download links, which can be retrieved with wget
    • Give the query to the Databus Client. The Client provides additional options for compression and format conversion, so you don't need to do it manually.
    • The collection can be loaded into various docker images, e.g. the Dockerized DBpedia 

What is missing?

Latest Core is our main development collection, where we will include the latest, new things. We include here a list of data that is still missing. Please check back after a while.

Missing Documentation and Statistics

Current issues

  • sameAs links to other DBpedia Chapters, i.e. de.dbpedia.org (in progress)
  • rdfs:label/comment/dbo:abstract only in English, was en + 19 languages, could be up to 140 languages (in progress)
  • ImageExtractor was malfunctioning and disabled, i.e. only images from infoboxes are extracted, no clean licenses. (Will be fixed with https://databus.dbpedia.org/dbpedia/wikidata/images/)
  • sameAs links to external Linked Data sites are currently not updated, (in progress, we are centralizing this with Global ID management
  • sdtypes from Mannheim need to be checked
  • Umbel in store, but not in Databus collection, loaded from https://github.com/structureddynamics/UMBEL/blob/master/External%20Ontol...
  • Yago types are missing (in progress)

What the future holds

  • Fused data: We already created several tests for a fused dataset of dbo properties. This dataset enriches the English version with Wikidata and dbo properties from over 20 Wikipedia languages, resulting in a denser graph.
  • Community extensions such as caligraph.org or https://ner.vse.cz/datasets/linkedhypernyms/ can now be streamlined and easier contributed with the Databus and routed to the main endpoint and chapter knowledge graphs.
  • Links, Mappings, Ontologies: A special focus of DBpedia will be to take the role of a custodian for links, mappings, ontologies on the web of data and make these easier to contribute and more centrally available.