DBpedia Live retrieves all edits on Wikipedia immediately, extracts all information and transfers them into an online SPARQL database for querying. The extraction framework can handle up to 1 million edits per day on an average server and in addition the Virtuoso Database handles the loading for online querying.
DBpedia is Wikipedia content represented as a Semantic Web of Linked Data. The original DBpedia representation was generated from a static dump of Wikipedia content, in a process that took roughly 6 months from Wikipedia dump to DBpedia publication. To update DBpedia, new Wikipedia dumps have been taken periodically (roughly every 6-12 months) since then and processed in the same way. DBpedia content has thus always been 6-18 months behind updates applied to WIkipedia content.
As the use of DBpedia, and the dynamism of Wikipedia content, have increased, the need for DBpedia to update constantly by processing the Wikipedia "firehose" changelog became apparent. DBpedia Live is the current fruit of that effort.
The core of DBpedia consists of an infobox extraction process. Infoboxes are templates contained in many Wikipedia articles. They are usually displayed in the top right corner of articles and contain semi-structured information.
Updates and Feedback
Note: parts of this page are outdated, detailed information can be found in this blog post.
- SPARQL Endpoint: The DBpedia-Live SPARQL Endpoint can be accessed at http://live.dbpedia.org/sparql.
- DBpedia-Live Statistics: Some simple extraction statistics are provided at http://live.dbpedia.org/live/.
- Updates: The N-Triples files containing the updates can be found at http://downloads.dbpedia.org/live/changesets.
- DBpedia-Live Sourcecode: https://github.com/dbpedia/extraction-framework/.
- Synchronization Tool: https://github.com/dbpedia/dbpedia-live-mirror/.
- Further Reading:
- DBpedia Live Extraction, 2009
- DBpedia and the Live Extraction of Structured Data from Wikipedia, 2012
- DBpedia Live Restart, 2019
DBpedia Live System Architecture
The main components of the DBpedia-Live system are as follows:
- Local Wikipedia: We have installed a local copy of Wikipedia that will be kept in synchronization with Wikipedia. The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) enables an application to get a continuous stream of updates from a wiki. OAI-PMH is also used to feed updates into the DBpedia-Live Extraction Manager.
- MappingWiki: DBpedia mappings can be found at http://mappings.dbpedia.org, which is itself a wiki. We also use OAI-PMH to get a stream of updates in DBpedia mappings. Basically, a change of mapping affects several Wikipedia pages, which should all be reprocessed.
- DBpedia Live Extraction Manager: This component is the actual DBpedia-Live extraction framework. When there is a page that should be processed, the framework applies the extractors to it. After processing a page, the newly extracted RDF statements are inserted into the backend data store (the Quad Store functionality of the Virtuoso Universal Server), where they replace the old RDF statements. The newly extracted RDF is also written to a compressed N-Triples file. Mirrors of DBpedia-Live, as well as other applications that should always be in synchronization with our DBpedia-Live endpoint, can download those changeset files and feed them into their own RDF data stores. The extraction manager is discussed in more detail below.
The new Java-based live-extraction framework is deployed on a server hosted by OpenLink Software. It has a SPARQL endpoint, also operated by OpenLink Software, at http://live.dbpedia.org/sparql, and its status can be viewed at http://live.dbpedia.org/live/. (To maintain functionality for applications developed with previous URIs, these remain accessible via http://dbpedia-live.openlinksw.com/sparql and http://dbpedia-live.openlinksw.com/live/, respectively, but the dbpedia.org-based URIs should be preferred going forward.)
In addition to the migration to Java, the new DBpedia Live framework has the following features:
- Abstract extraction: The abstract of of a Wikipedia article contains the first few paragraphs of that article. The new framework has the ability to cleanly extract the abstract of an article.
- Mapping-affected pages: Upon a change in mapping, the pages affected by that mapping should be reprocessed and their RDF descriptions should be updated to reflect that change.
- Updating unmodified pages: Sometimes a change in the system occurs, e.g. a change in the implementation of an extractor. This change can affect many pages even if they are not modified. In DBpedia-Live, we use a low-priority queue for such changes, such that the updates will eventually appear in DBpedia-Live, but recent Wikipedia updates are processed first.
- Publication of changesets: Upon modification, old RDF statements are replaced with updated statements. The added and/or deleted statements are also written to N-Triples files and then compressed. Any client application or DBpedia-Live mirror can download the files and integrate (and, hence, update) a local copy of DBpedia. This enables that application to stay in synchronization with our version of DBpedia-Live.
- Development of synchronization tool: The synchronization tool enables a DBpedia-Live mirror to stay in synchronization with our live endpoint. It downloads the changeset files sequentially, decompresses them, and integrates them with another DBpedia-Live mirror.
In addition to the infobox extraction process, the framework has currently 19 extractors which process the following types of Wikipedia content:
- Interlanguage links
- External links
- Page links
- Person data
- SKOS categories
- Page ID
- Revision ID
- Category label
- Article categories
Q: Does DBpedia-Live automatically resume from the point where it has stopped, or start from the current timestamp?
A: DBpedia-Live will resume at the last point at which it stopped.
Q: The live-updates of DBpedia (changesets) have the structure year/month/day/hour/xxxx.nt.gz. What does it mean if there are some gaps in between, e.g., a folder for some hour is missing?
A: This means that the service was down at that time.
Q: Are the DBpedia-Live services available via IPv6?
Q: Can the speed of processing of DBpedia-Live cope with the speed of data-stream?
A: According to our statistics, roughly 1.4 Wikipedia articles are modified per second, which works out to about 84 articles per minute. With current resources, DBpedia-Live can process about 105 pages per minute, on average.
Q: Does an article change in Wikipedia result in only 2 files per article (one for delete and one for added triples) or do you spread this over several files?
A: An article update results in two sets of RDF statements: one for the added statements, and one for the deleted statements. To lower the number of files in our updates folder, we combine the statements about several articles into one file.
Q: Does DBpedia-Live also address the issue of a change in infobox mappings?
Q: If I want to maintain a DBpedia-Live mirror, why do I need to download the latest DBpedia-Live dump from http://live.dbpedia.org/dumps/?
A: The DBpedia-Live dumps are not currently available. They may be restored in future. You may start from any DBpedia-Live dump, but the more recent the dump, the fewer changeset files will need to be downloaded and applied to the triple store at initial launch. Thus, the fastest strategy is to start from the latest dump.
Q: Where can I find the synchronization tool for DBpedia-Live, i.e., the tool that synchronizes a DBpedia-Live mirror with ours?
A: You can download the DBpedia Integrator tool from https://github.com/dbpedia/dbpedia-live-mirror.
Q: If I find a bug in the extraction framework, how can I report that bug?
A: You can use the DBpedia bug tracker on github to post the bug.