DBpedia Live


DBpedia is considered the Semantic Web mirror of Wikipedia. By time, Wikipedia articles are revised, which makes the data in DBpedia outdated.
The main objective of DBpedia-Live is to keep DBpedia always in synchronization with Wikipedia.


Contents

1. Overview

The core of DBpedia consists of an infobox extraction process. Infoboxes are templates contained in many Wikipedia articles. They
are usually displayed in the top right corner of articles and contain factual information.
Apart from the infobox extraction, the framework has currently 19 extractors which process the following types of Wikipedia content:

  • Labels.
  • Abstracts.
  • Interlanguage links.
  • Images.
  • Redirects.
  • Disambiguation.
  • External links.
  • Page links.
  • Homepages.
  • Geo-coordinates.
  • Person data.
  • PND.
  • SKOS categories.
  • Page ID.
  • Revision ID.
  • Category label.
  • Article categories.
  • Mappings.
  • Infobox.

2. DBpedia-Live System Architecture

http://live.dbpedia.org/DBpedia_Architecture_large.png
The main components of DBpedia-Live system are as follows:

  • Local Wikipedia: We have installed a local Wikipedia that will be in synchronization with Wikipedia. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) enables an application to get a continuous stream of updates from a wiki. OAI-PMH is also used to feed updates into DBpedia-Live Extraction Manager.
  • Mapping Wiki: DBpedia mappings can be found at http://mappings.dbpedia.org. It is also a wiki. We can also use OAI-PMH to get stream of updates in DBpedia mappings. Basically, a change of mapping a ects several Wikipedia pages, which should be reprocessed.
  • DBpedia-Live Extraction Manager: This component is the actual DBpedia-Live extraction framework. When there is a page that should be processed, the framework applies the extractors on it. After processing a page, the newly extracted triples are inserted into the backend triple store (Virtuoso), overwriting the old triples. The newly extracted triples are also written as N-Triples file and compressed. Other applications or DBpedia-Live mirrors that should always be in synchronization with our DBpedia-Live can download those files and feed them into its own triplestore. The extraction manager is discussed in more detail below.

3. New Features

The old php-based framework is deployed on one of OpenLink servers and currently has a SPARQL endpoint at http://dbpedia-live.openlinksw.com/sparql.
In addition to the migration to Java, the new DBpedia-Live framework has the following new features:

  1. Abstract extraction: The abstract of of a Wikipedia article are the first few paragraphs of that article. The new framework has the ability to cleanly extract the abstract of an article.
  2. Mapping-affected pages: Upon a change in mapping, the pages affected by that mapping should be reprocessed and their triples should be updated to reflect that change.
  3. Updating unmodified pages: Sometimes a change in the system occurs, e.g. a change in the implementation of an extractor. This change can affect many pages even if they are not modified. In DBpedia Live, we use a low-priority queue for such changes, such that the updates will eventually appear in DBpedia Live, but recent Wikipedia updates are processed first.
  4. Publication of changesets: Upon modifications old triples are replaced with updated triples. Those added and/or deleted triples are also written as N-Triples files and then compressed. Any client application or DBpedia-Live mirror can download those files and integrate and, hence, update a local copy of DBpedia. This enables that application to always in synchronization with our DBpedia-Live.
  5. Development of synchronization tool: The synchronization tool enables a DBpedia-Live mirror to stay in synchronization with our live endpoint. It downloads the changeset files sequentially, decompresses them and integrates them with another DBpedia-Live mirror.

4. Important Pointers

5. FAQ

Q: Does the DBpedia-Live automatically resume from the point where it has stopped, or it starts from the current timestamp?
A: DBpedia-Live will start from the last point at which it has stopped.


Q: The live-updates of DBpedia (changesets) has the structure year/month/day/hour/xxxx.nt.gz, what does it mean if there are some gaps in between, e.g. a folder of some hour is missing?
A: This means that the service was down at that time.


Q: Can the speed of processing of DBpedia-Live cope with the speed of data-stream?
A: According to our statistics, 1.4 Wikipedia articles are modified per second which results in 84 articles per minute.
DBpedia-Live can on average process about 105 pages per minute on average.


Q: Does an article change in Wikipedia result in only 2 files per article (one for delete and one for added triples) or do you spread this over several files?
A: Actually, an article update results in two sets of triples one for the added triples and the other one is for the deleted triples.
In order not to have too many files in our updates folder available at http://live.dbpedia.org/liveupdates/, we combine the triples of several articles into one file.


Q: Does DBpedia-Live also address the issue of a change in infobox mappings?
A: Yes.


Q: If I want to maintain a DBpedia-Live mirror, why do I need to download the latest DBpedia-Live dump from http://live.dbpedia.org/dumps/?
A: Basically, you don't have to download the latest, but this strategy is faster, as you will start from a filled triple store, so you have to download less number of changeset files.


Q: Where can I find the synchronization tool fro DBpedia-Live, i.e. the tools the synchronizes a DBpedia-Live mirror with ours?
A: You can download Dbpedia Integrator tool from https://sourceforge.net/projects/dbpintegrator/files/.


Q: If I find a bug in the extraction framework, how can I report that bug?
A: You can use the DBpedia bug tracker to post the bug.


 
There are no files on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2011-08-19 14:28:43 by Soeren Auer