The DBpedia Information Extraction Framework


The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia Mercurial (GNU GPL License). See also the change log for recent developments.
This page is the entry point to the developer documentation of the DBpedia framework. For historic reasons, the documentation of the old, superseded PHP-based DBpedia extraction framework is still available here.


Contents

1. Getting started

Before you can start developing you need to take care of some prerequisites:

  • DBpedia Extraction Framework Get the most recent revision from the Mercurial repository (read-only).
  • Java Development Kit The DBpedia extraction framework runs on top of the JVM. Get the most revent JDK from http://java.sun.com/.
  • Maven is used for project management and build automation. Get it from: http://maven.apache.org/

This is enough to compile and run the DBpedia extraction framework.


If you'd like to use a IDE for coding there are a number of options:

2. Overview

The DBpedia extraction framework is structured into different modules

  • Core Module: Contains the core components of the framework.
  • Dump extraction Module: Contains the DBpedia dump extraction application.

3. Core Module

http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png


Components

  • Source: The Source package provides an abstraction over a source of Media Wiki pages.
  • WikiParser: The Wiki Parser package specifies a parser, which transforms an Media Wiki page source into an Abstract Syntax Tree (AST).
  • Extractor: An extractor is a mapping from a page node to a graph of statements about it.
  • Destination: The Destination package provides an abstraction over a destination of RDF statements.

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

For details about a package follow the links.
You can find the complete scaladoc here

4. Dump extraction Module

4.1. Prerequisites

All configuration is read from a Java properties file named config.properties. The following properties are available:

  • dumpDir The directory where the dumps are located.
  • updateDumps If true, the extraction framework will download every dump which is either missing or not up-to-date. If you want to use your own dumps or don't want the framework to update the dumps, set it to false.
  • ouptutDir The output directory.
  • languages The languages of the Wikipedia dumps to be extracted.
  • extractors The extractor classes to be used for the extraction. See Available Extractors. Language specific extractors can be configured using a property of the format extractors.{wikiCode} e.g. extractors.en

4.2. Running the dump extraction

Before you can start the extraction you need to install the framework into your maven repository by running mvn install from the extraction directory.
The dump extraction is started by running mvn scala:run from the directory extraction/dump.

5. Server Module

This module is intended for testing the framework.

5.1. Prerequisites

There are two Scala classes that configure the parameters of the server:

  • In org.dbpedia.extraction.server.Configuration, you can configure the possible languages and URL to the mappings wiki API.
  • In org.dbpedia.extraction.server.ExtractionManager in the function loadExtractor, you can configure the extractors that should be used by the extraction server. See Available Extractors.

5.2. Running the extraction server

Before you can start the server you need to install the framework into your maven repository by running mvn install from the extraction directory.
The extraction server is started by running mvn scala:run from the directory extraction/server. The standard port is 9999.
A browser window should open in which you can specify the language and the URI that you would like to extract.



 
There are no files on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2011-03-07 16:00:37 by Max Jakob