The DBpedia Data Set

The DBpedia data set is a large multi-domain ontology which has been derived from Wikipedia. The DBpedia data set currently describes 2.6 million “things” with 274 million “facts” (November 2008).


Contents

1. Background

Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates,categorisation information, images, geo-coordinates, and links to external Web pages. For instance, the figure below shows the source code and the visualisation of an infobox template containing structured information about the town of Innsbruck.



The DBpedia project extracts various kinds of structured information from Wikipedia editions in 14 languages and combines this information into a huge, cross-domain knowledge base.


DBpedia uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the SPARQL query language to query this data. Please refer to the Developers Guide to Semantic Web Toolkits to find a development toolkit in your preferred programming language to process DBpedia data.

2. Content of the DBpedia Data Set

The DBpedia data set currently consists of around 274 million RDF triples, which have been extracted from the English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, Chinese, Russian, Finnish and Norwegian versions of Wikipedia.


The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies. The knowledge base consists of 274 million pieces of information (RDF triples). It features labels and short abstracts for these things in 14 different languages; 609,000 links to images and 3,150,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories.


The table below contains links to some example “things” from the data set:


Class Examples
City Cambridge, Berlin, Manchester
Country Spain, Iceland, South Korea
Politician George W. Bush, Nicolas Sarkozy, Angela Merkel
Musician AC/DC, Diana Ross, Röyksopp
Music album Led Zeppelin III, Like a Virgin, Thriller
Director Woody Allen, Oliver Stone, Takashi Miike
Film Pulp Fiction, Hysterical Blindness, Breakfast at Tiffany's
Book The Lord of the Rings, The Adventures of Tom Sawyer, The Holy Bible
Computer Game Tetris, World of Warcraft, Sam & Max hit the Road
Technical Standard HTML, RDF, URI

You can also use Richard Cyganiak's PHP script to view random things from the DBpedia data set.

3. Identifying “things”

Each thing in the DBpedia data set is identified by a URI reference of the form http://dbpedia.org/resource/Name, where Name is taken from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each resource is tied directly to an English-language Wikipedia article.

4. Describing “things”

Each DBpedia resource is described by various properties. Below, we give an overview about the most important types of properties.

4.1. Basic Information

Every DBpedia resource is described by a label, a short and long English abstract, a link to the corresponding Wikipedia page, and a link to an image depicting the thing (if available).


If a thing exists in multiple language versions of Wikipedia, then short and long abstracts within these languages and links to the different language Wikipedia pages are added to the description. The DBpedia data set contains the following numbers of abstracts per language:


Language Number of Abstracts
English 2,490,000
German 391,000
French 383,000
Dutch 284,000
Polish 256,000
Italian 286,000
Spanish 226,000
Japanese 199,000
Portuguese 246,000
Swedish 144,000
Chinese 101,000

4.2. Classifications

DBpedia provides three different classification schemata for things.


  1. Wikipedia Categories are represented using the SKOS vocabulary.
  2. The YAGO Classification is derived from the Wikipedia category system using Word Net. Please refer to PDF DocumentYago: A Core of Semantic Knowledge – Unifying WordNet and Wikipedia for more details.
  3. Word Net Synset Links were generated by manually relating Wikipedia infobox templates and Word Net synsets, and adding a corresponding link to each thing that uses a specific template. In theory, this classification should be more precise then the Wikipedia category system.

Using these classifications within SPARQL queries allows you to select things of a certain type.

4.2.1. Wikipedia Categories

4.2.2. YAGO Classes

4.2.3. Wordnet


4.3. Infobox Data

Wikipedia infoboxes contain very specific information about things and are thus a very valuable source of structured information that can be used to ask expressive queries against Wikipedia. The DBpedia project currently extracts two different datasets from the Wikipedia infoboxes.


  1. The Infobox Dataset is created using our initial, now one year old infobox parsing approach. It extracts information from all infoboxes within all articles. That data set contains 22.8 million pieces of information that have been extracted from infoboxes within the English version of Wikipedia. The types of the infobox properties depend on the type of infobox, and there are approximately 8000 different property types. There is no formal ontology in this data set.
  2. The Infobox Ontology. With the DBpedia 3.2 release, we introduced a new infobox extraction method which is based on hand-generated mappings of Wikipedia infoboxes to a newly created DBpedia ontology. The ontology consists of 170 classes which form a subsumption hierarchy and have altogether 900 properties. The mappings adjust weaknesses in the Wikipedia infobox system, like have different infoboxes for the same class of thing or using different property names for the same property. Therefore, the instance data within the infobox ontology is much cleaner and better structured than the previous one, but it currently doesn't cover the whole range of infoboxes and infobox properties within Wikipedia.

Both data sets are available for download as well as being available for queries via the DBpedia SPARQL endpoint.


The infobox data enables sophisticated, fine-grained queries over the data set. Some example queries are shown below:

4.3.1. Querying the Infobox Dataset


4.3.2. Querying the Infobox Ontology

List all episodes of the HBO television series The Sopranos ordered by their air-date:



SPARQL Result


Software developed by an organisation founded in California:



SPARQL Result


4.4. External Links

The DBpedia data set contains HTML links to external web pages as well as RDF links into external data sources.


There are two types of links to HTML pages: dbpedia:reference links point to several web pages about a thing. In addition, some things also have foaf:homepage links that point to web pages that can be considered the “official homepage” of a thing.


RDF links are represented using the owl:sameAs property. Please refer to Interlinking for more information about RDF links and the interlinked data sets.

4.4.1. FOAF Homepage

4.4.2. Owl:sameAs Links

4.5. Geo-Coordinates

The DBpedia data set contains geo-coordinates for 392,000 geographic locations. Geo-coordinates are expressed using the W3C Basic Geo Vocabulary.


Besides simple listings of geo-coordinates (e.g., German soccer stadiums ), the new geo-coordinates allow sophisticated queries, like “show me all things next to the”:

5. License

The DBpedia data set is licensed under the terms of GNU Free Documentation License.


This material is Open Knowledge.


 
There is one file on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2008-11-17 21:44:32 by Ted Thibodeau Jr