The DBpedia Data Set
The DBpedia data set is a large multi-domain ontology which has been derived from Wikipedia. The DBpedia data set currently describes 2.9 million things with 479 million facts (November 2009).
1. Background
Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as
infobox templates,
categorisation information, images, geo-coordinates, and links to external Web pages. For instance, the figure below shows the source code and the visualisation of an infobox template containing structured information about the town of Innsbruck.

The DBpedia project extracts various kinds of structured information from Wikipedia editions in 14 languages and combines this information into a huge, cross-domain knowledge base.
DBpedia uses the
Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the
SPARQL query language to query this data. Please refer to the
Developers Guide to Semantic Web Toolkits to find a development toolkit in your preferred programming language to process DBpedia data.
2. Content of the DBpedia Data Set
The DBpedia data set currently consists of around 274 million RDF triples, which have been extracted from the English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia.
The DBpedia knowledge base currently describes more than 2.9 million things, including at least 282,000 persons, 339,000 places (including 241,000 populated places), 88,000 music albums, 44,000 films, 15,000 video games, 119,000 organizations (including 20,000 companies and 29,000 educational institutions), 130,000 species and 4,400 diseases. The DBpedia knowledge base features labels and abstracts for these things in 91 different languages; 807,000 links to images and 3,840,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories. The knowledge base consists of 479 million pieces of information (RDF triples) out of which 190 million were extracted from the English edition of Wikipedia and 289 million were extracted from other language editions.
The table below contains links to some example things from the data set:
| Class | Examples |
| City | |
| Country | |
| Politician | |
| Musician | |
| Music album | |
| Director | |
| Film | |
| Book | |
| Computer Game | |
| Technical Standard | |
You can also use Richard Cyganiak's PHP script to
view random things from the DBpedia data set.
3. Identifying things
Each thing in the DBpedia data set is identified by a URI reference of the form http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly to an English-language Wikipedia article.
4. Describing things
Each DBpedia resource is described by various properties. Below, we give an overview about the most important types of properties.
4.1. Basic Information
Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).
If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description. The DBpedia data set contains the following numbers of abstracts per language (November 2009):
| Language | Number of Abstracts |
| English | 2,943,000 |
| German | 460,000 |
| French | 495,000 |
| Dutch | 363,000 |
| Polish | 379,000 |
| Italian | 348,000 |
| Spanish | 295,000 |
| Japanese | 199,000 |
| Portuguese | 329,000 |
| Swedish | 195,000 |
| Chinese | 141,000 |
4.2. Classifications
DBpedia provides three different classification schemata for things.
- Wikipedia Categories are represented using the
SKOS vocabulary.
- The YAGO Classification is derived from the Wikipedia category system using Word Net. Please refer to
Yago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia for more details.
- Word Net Synset Links were generated by manually relating Wikipedia infobox templates and Word Net synsets, and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.
Using these classifications within SPARQL queries allows you to select things of a certain type.
4.2.1. Wikipedia Categories
-
NBA Teams (Does not work with Internet Explorer)
-
Car manufacturers
4.2.2. YAGO Classes
4.2.3. Wordnet
4.3. Infobox Data
Wikipedia infoboxes contain very specific information about things and are thus a very valuable source of structured information that can be used to ask expressive queries against Wikipedia. The DBpedia project currently extracts three different datasets from the Wikipedia infoboxes.
- The Infobox Dataset is created using our initial, now two year old infobox parsing approach. This extractor extracts all properties from all infoboxes and templates within all Wikipedia articles. Extracted information is represented using properties in the http://dbpedia.org/property/ namespace. The names of the these properties directly reflect the name of the WIkipedia infobox property. Property names are not cleaned or merged. Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. Currently, there are approximately 8000 different property types. The infobox extractor performs only a minimal amount of property value clean-up, e.g. by converting a value like June 2009 to the XML Schema format 2009–06. You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data.
- The Infobox Ontology. With the DBpedia 3.2 release, we introduced a new infobox extraction method which is based on hand-generated mappings of Wikipedia infoboxes/templates to a newly created DBpedia ontology. The ontology consists of 205 classes which form a subsumption hierarchy and have altogether 1200 properties. The mappings adjust weaknesses in the Wikipedia infobox system, like have different infoboxes for the same class or using different property names for the same property. Therefore, the instance data within the infobox ontology is much cleaner and better structured than the Infobox Dataset, but currently doesn't cover all infobox types and infobox properties within Wikipedia. Starting with DBpedia release 3.4, we provide two different Infobox Ontology data sets:
- The Loose Infobox Ontology uses ontology properties (e.g. 'volume') that may be applied to different things (e.g. the volume of a lake and the volume of a planet). This restricts the number of different properties to a minimum, but has the drawback that it is not possible to automatically infer the class of an entity based on a property. For instance, an application that discovers an entity described using the volume property cannot infer that that the entity is a lake and then for example use a map to visualize the entity. Loose Infobox data is represented using properties following the http://dbpedia.org/ontology/{propertyname} naming schema. Property values directly use the units of measurement given in the Wikipedia article. There may be different units being used for the same property (e.g. sometimes cubic metres, sometimes cubic inches). When the property value can not be parsed, the original string value of the Wikipedia template property is used. You should therefore use the Loose Infobox Ontology if your application requires a minimal number of different properties, you don't need class reasoning and you are prepered to accept unnormalized units of measurement.
- The Strict Infobox Ontology uses different ontology properties to represent Wikipedia properties with the same name that are used to describe different things (e.g. 'Lake/volume' and 'Planet/volume'). Strict Infobox properties follow the http://dbpedia.org/ontology/{Class}/{property} naming schema. The properties have a single class as rdfs:domain and rdfs:range and can therefore be used for classification reasoning. The units of measurement are normalized, meaning that different units used in the Wikipedia templates are converted to the target units used in the ontology (for instance kilometer, meter and centimeter are all converted into meter). This makes it easier to express queries against the data, e.g. finding all lakes whose volume is in a certain range.
All three data sets are available for download as well as being available for queries via the DBpedia SPARQL endpoint.
The infobox data enables sophisticated, fine-grained queries over the data set. Some example queries are shown below:
4.3.1. Querying the Infobox Dataset
-
Abstracts of movies starring Tom Cruise, released before 1999
-
The official websites of companies with more than 50000 employees
-
Cities with more than 2 million habitants
4.3.2. Querying the Infobox Ontology
List all episodes of the HBO television series The Sopranos ordered by their air-date:
Software developed by an organisation founded in California:
4.4. External Links
The DBpedia data set contains HTML links to external web pages as well as RDF links into external data sources.
There are two types of links to HTML pages: dbpedia:reference links point to several web pages about a thing. In addition, some things also have foaf:homepage links that point to web pages that can be considered the official homepage of a thing.
RDF links are represented using the owl:sameAs property. Please refer to Interlinking for more information about RDF links and the interlinked data sets.
4.4.1. FOAF Homepage
4.4.2. Owl:sameAs Links
- Geographical (to
geonames.org,
eurostat data and the RDF version of the
CIA Factbook, both served at the FU Berlin):
- Authors / Books (to
quotationsbook.com and
Project Gutenberg RDF, served at the FU Berlin. Links to the
RDF Book Mashup will follow soon ):
- Computer Scientist publications
DBLP, served at the FU Berlin:
- U.S. Census Statistical Data
rdfabout.com, RDF version by
Joshua Tauber:
4.5. Geo-Coordinates
The DBpedia data set contains geo-coordinates for 392,000 geographic locations. Geo-coordinates are expressed using the
W3C Basic Geo Vocabulary.
Besides simple listings of geo-coordinates (e.g.,
German soccer stadiums ), the new geo-coordinates allow sophisticated queries, like show me all things next to the:
5. License
DBpedia is derived from
Wikipedia and is distributed under the same licensing terms as Wikipedia itself. As Wikipedia has moved to dual-licensing, we also dual-license DBpedia starting with release 3.4.
DBpedia 3.4 data is licensed under the terms of the
Creative Commons Attribution-ShareAlike 3.0 license and the
GNU Free Documentation License. All DBpedia releases upto release 3.3 are licensed unter the terms of the
GNU Free Documentation License only.
This material is Open Knowledge.
Information
Last Modification:
2009-11-20 13:59:18 by Anja Jentzsch
