The DBpedia Dataset
The DBpedia dataset is a large multi-domain ontology which has been derived from Wikipedia. The DBpedia dataset currently describes 2.18 million things with 218 million facts (February 2008).
1. Background
Wikipedia has grown into one on the central knowledge sources of mankind and is maintained by thousands of contributors.
Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as
infobox templates,
categorisation information, images, geo-coordinates and links to external Web pages.
For instance, the figure below shows the source code and the visualisation of a infobox template containing structured information about the town of Innsbruck.

This structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content.
The DBpedia.org project uses the
Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the
SPARQL query language to query this data. Please refer to the
Developers Guide to Semantic Web Toolkits to find a development toolkit in your preferred programming language to process DBpedia data.
2. Content of the DBpedia Dataset
The DBpedia dataset currently consists of around 218 million RDF triples, which have been extracted from the English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia.
The DBpedia dataset describes 2,180,000 things, including at least including at least 80,000 persons, 293,000 places, 62,000 music albums, 36,000 films. It contains 489,000 links to images, 2,700,000 links to relevant external web pages, 2,101,000 external links into other RDF datasets, 207,000 Wikipedia categories and 75,000 YAGO categories.
The table below contains links to some example things from the dataset:
| Class | Examples |
| City | |
| Country | |
| Politician | |
| Musician | |
| Music album | |
| Director | |
| Film | |
| Book | |
| Computer Game | |
| Technical Standard | |
You can also use Richard Cyganiak's PHP script to
view random things from the DBpedia dataset.
3. Identifying things
Each of the 2.18 million resources described in the DBpedia dataset is identifed by a URI reference of the form
http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly to an English-language Wikipedia article.
4. Describing things
Each DBpedia resource is described by various properties. Below, we give an overview about the most important types of properties.
4.1. Basic Information
Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page and a link to an image depicting the thing (if availiable).
If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description. The DBpedia dataset contains the following number of abstracts per language:
| Language | Number of Abstracts |
| English | 2,180,000 |
| German | 329,000 |
| French | 289,000 |
| Dutch | 220,000 |
| Polish | 151,000 |
| Italian | 188,000 |
| Spanish | 169,000 |
| Japanese | 161,000 |
| Portuguese | 176,000 |
| Swedish | 133,000 |
| Chinese | 82,000 |
4.2. Classifications
DBpedia provides three different classification schemata for things.
- Wikipedia Categories which are represented using the
SKOS vocabulary.
- The YAGO Classification which is derived from the Wikipedia category system using Word Net. Please refer to
Yago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia for more details.
- Word Net Synset Links These links were generated by manually relating Wikipedia infobox templates and Word Net synsets and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.
Using the classifications within SPARQL queries allows you to select things of a certain type.
Wikipedia Categories:
-
NBA Teams (Does not work with Internet Explorer)
-
Car manufacturers
YAGO Classes:
Wordnet:
4.3. Infobox Data
The DBpedia dataset contains 22.8 million pieces of information that have been extracted from infoboxes within the English version of Wikipedia. The types of the infobox properties depend on the type of the infobox and there are approximately 8000 different property types. Many infobox property values are typed using XML datatypes.
The infobox data enables sophisticated, fine-grained queries over the dataset. Some example queries are shown below:
-
Abstract of movies starring Tom Cruise, released before 1999
-
The official website of companies with more than 50000 employees
-
Cities with more than 2 million habitants
4.4. External Links
The DBpedia dataset contains HTML links to external webpages as well as RDF links into external data sources.
There are two types of links to HTML pages: dbpedia:reference links point to several web pages about a thing. In addtition, some things also have additional foaf:homepage links that point to webpages that can be considered the official homepage of a thing.
RDF links are represented using the owl:sameAs property. Please refer to Interlinking for more information about RDF links and the interlinked datasets.
FOAF Homepage:
Owl:sameAs Links
- Geographical (to
geonames.org,
eurostat data and the RDF version of the
CIA Factbook, both served at the FU Berlin):
- Authors / Books (to
quotationsbook.com and
Project Gutenberg RDF, served at the FU Berlin. Links to the
RDF Book Mashup will follow soon ):
- U.S. Census Statistical Data
rdfabout.com, RDF version by
Joshua Tauber:
4.5. Geo-Coordinates
The DBpedia dataset contains geo-coordinates for 293,000 geographic locations. Geo-coordinates are expressed using the
W3C Basic Geo Vocabulary.
Besides simple listings of geo-coordinates (e.g.
German soccer stadiums ), the new geo-coordinates allow sophisticated queries, like show me all things next to the:
5. License
The DBpedia dataset is licensed under the terms of
GNU Free Documentation License.
This material is Open Knowledge.
Information
Last Modification:
2008-02-27 16:06:18 by Brendan Wyse
