The DBpedia Information Extraction Framework
The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia
Mercurial (GNU GPL License). See also the change log for recent developments.
This page is the entry point to the developer documentation of the DBpedia framework. For historic reasons, the documentation of the old, superseded PHP-based DBpedia extraction framework is still available here.
1. Getting started
Before you can start developing you need to take care of some prerequisites:
- DBpedia Extraction Framework Get the most recent revision from the
Mercurial repository (read-only).
- Java Development Kit The DBpedia extraction framework runs on top of the JVM. Get the most revent JDK from
http://java.sun.com/.
- Maven is used for project management and build automation. Get it from:
http://maven.apache.org/
This is enough to compile and run the DBpedia extraction framework.
If you'd like to use a IDE for coding there are a number of options:
-
IntelliJ IDEA Currently the most stable IDE for developing with Scala. To get the most recent Scala Plugin get the current
early access version and install the Scala plugin from the official repository.
-
Eclipse: Please follow the DBpedia & Eclipse Quick Start Guide.
-
Netbeans also offers an Scala plugin
2. Overview
The DBpedia extraction framework is structured into different modules
- Core Module: Contains the core components of the framework.
- Dump extraction Module: Contains the DBpedia dump extraction application.
3. Core Module

Components
- Source: The Source package provides an abstraction over a source of Media Wiki pages.
- WikiParser: The Wiki Parser package specifies a parser, which transforms an Media Wiki page source into an Abstract Syntax Tree (AST).
- Extractor: An extractor is a mapping from a page node to a graph of statements about it.
- Destination: The Destination package provides an abstraction over a destination of RDF statements.
In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:
- Ontology Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace
org.dbpedia.extraction.ontology
- DataParser Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace
org.dbpedia.extraction.dataparser
- Util Various utility classes. All classes are located in the namespace
org.dbpedia.extraction.util
For details about a package follow the links.
You can find the complete scaladoc
here
4. Dump extraction Module
4.1. Prerequisites
All configuration is read from a Java properties file named config.properties. The following properties are available:
- dumpDir The directory where the dumps are located.
- updateDumps If true, the extraction framework will download every dump which is either missing or not up-to-date. If you want to use your own dumps or don't want the framework to update the dumps, set it to false.
- ouptutDir The output directory.
- languages The languages of the Wikipedia dumps to be extracted.
- extractors The extractor classes to be used for the extraction. See Available Extractors. Language specific extractors can be configured using a property of the format extractors.{wikiCode} e.g. extractors.en
4.2. Running the dump extraction
Before you can start the extraction you need to install the framework into your maven repository by running mvn install from the extraction directory.
The dump extraction is started by running mvn scala:run from the directory extraction/dump.
5. Server Module
This module is intended for testing the framework.
5.1. Prerequisites
There are two Scala classes that configure the parameters of the server:
- In org.dbpedia.extraction.server.Configuration, you can configure the possible languages and URL to the mappings wiki API.
- In org.dbpedia.extraction.server.ExtractionManager in the function loadExtractor, you can configure the extractors that should be used by the extraction server. See Available Extractors.
5.2. Running the extraction server
Before you can start the server you need to install the framework into your maven repository by running mvn install from the extraction directory.
The extraction server is started by running mvn scala:run from the directory extraction/server. The standard port is 9999.
A browser window should open in which you can specify the language and the URI that you would like to extract.
Information
Last Modification:
2011-03-07 16:00:37 by Max Jakob