Former PHP-based Information Extraction Framework

Until March 2010, the DBpedia project was using a PHP-based extraction framework to extract different kinds of structured information from Wikipedia. This framework has been superseded by the new Scala-based extraction framework and the old PHP framework is not maintained anymore.

This page contains the documentation of the old PHP framework which remains in the DBedia wiki for archiving and historic reasons.

The superseded PHP-based DBpedia information extraction framework is written using PHP 5. The framework is available from the DBpedia SVN (GNU GPL License).

This pages describes the DBpedia information extraction framework. The framework consists of the interfaces: Destination, Extractor, PageCollection, and RDFnode, plus the essential classes ExtractionGroup, ExtractionJob, ExtractionManager, ExtractionResult, and RDFtriple.

1. Getting started

To get the framework running on your local PC, it is recommended to start with the pre-configured extract_*.php files. You will find these in the DBpedia /extraction folder. The extraction code is available via Sourceforge SVN (do not download the release version at Sourceforge).

If you want to create your own dumps or work within your IDE or on the console, you should use extract_*.php. If you first want to learn how DBpedia extraction works, or test /debug new extractors, webStart.php might best suit your needs, as it gives you a comfortable web debug interface. Make sure to download RAP – RDF API for PHP first if you want to use the web interface.

2. Functional overview

The code extraction process is triggered via the ExtractionManager, which starts one or more ExtractionJobs. An ExtractionJob combines one or more ExtractionGroups with a PageCollection. The PageCollection is the data source, e.g., all articles from the Wikipedia SQL-Dump. ExtractionGroups consist of a Destination and one or more Extractors. Possible Destinations include your console, NTriple files and the web interface. Of course you are free to write your own destinations (e.g., databases, RDF/XML files).

Extractors are designed for single specific purposes, e.g., the InfoboxExtractor reads out information from Wikipedia Infoboxes. The ShortAbstractExtractor gets the first paragraph from an article and so on. DBpedia comes with extractors for many purposes already, though you are invited to add your own.
Extractors are the core of the data extraction process, as they parse and convert the Wikipedia pages. For each page, the extracted data is stored in an instance of ExtractionResult.

ExtractionGroups connect Extractors with Destinations. If you want to store the ExtractionResults from all Extractors in a single file, one ExtractionGroup is sufficient. Just create a new ExtractionGroup with a Destination and add the extractors you need.
If you intend to produce separate output files for each Extractor, you will need an own ExtractionGroup for each extractor (as is done in the standard settings in extract_full.php). Finally you will need to run your ExtractionJob through an instance of an ExtractionManager.

3. The Interfaces

3.1. Interface Destination

Destinations store extraction results. Included Destinations are NTriple files (NTripleDumpDestination), the console (SimpleDumpDestination) and a web interface (WebDebugDestination).

A Destination must include the following methods:

  • start(): Initializes the Destination (e.g., creates a new NTriple file). Is called once at the beginning of an ExtractionJob.
  • accept($extractionResult, $revisionID): Requires an ExtractionResult and a page revision. Reads out each triple from the ExtractionResult and prints it out or stores it in a file. Is called for each page.
  • finish(): Closes the destination. Is called once at the end of an ExtractionJob.

interface Destination {
    public function start();
    public function accept($extractionResult, $revisionID);
    public function finish();

3.2. Interface Extractor

Extractors include the actual data extraction and parsing functions. An Extractor should
be written for a single specific purpose.

An Extractor must have the methods:

  • start($language): Initializes the extractor and sets the language. Is called once at the beginning of an extraction job.
  • extractPage($pageID, $pageTitle, $pageSource): Includes the actual extraction process. Constructs a new ExtractionResult, extracts data from the source page and stores the extracted data in the ExtractionResult. Is called for each page and must return an ExtractionResult.
  • finish(): Closes the extractor. Is called once at the end of an ExtractionJob.

interface Extractor {
    /** @return uri */
    public function getExtractorID();
    public function start($language);
    /** @return ExtractionResult */
    public function extractPage($pageID, $pageTitle, $pageSource);
    /** @return ExtractionResult */
    public function finish();

3.3. Interface PageCollection

PageCollections are the data sources for extraction. A PageCollection loads the page source code for a specific language and one or more pages (Implementations: LiveWikipedia, DatabaseWikipedia).

A PageCollection must have the methods:

  • getLanguage(): returns the language
  • getSource($pageTitle): returns the Wikipedia source code for the page $pageTitle
  • getRevision($pageTitle): returns the page revision

interface PageCollection {
    public function getLanguage();
    public function getSource($pageTitle);
    public function getRevision($pageTitle);

3.4. Interface RDFnode

RDFnodes take care of proper RDF representation of data.

URI, RDFliteral, and RDFblankNode are implementations of RDFnodes.

The most important method is toNTriples(), which returns a string containing the NTriples representation of the RDFnode. In addition, information such as datatype, language, and lexical form of literals can be read out from an RDFnode of class RDFliteral.

A RDFnode must include the methods:

  • isURI(): Returns true if the node is an URI, false else.
  • isBlank(): Returns true if the node is a blanknode, false else.
  • isLiteral(): Returns true if the node is a literal, false else.
  • getURI(): Returns the URI if the node is an URI, null else.
  • getLexicalForm(): Returns the literal text if the node is a literal, null else.
  • getLanguage(): Returns the language if the node is a literal, null else.
  • getDatatype(): Returns the datatype if the node is a literal, null else.
  • *toNTriples(): Returns the NTriple representation of a RDF node.

interface RDFnode {
    public function isURI();
    public function isBlank();
    public function isLiteral();
    public function getURI();
    public function getBlankNodeLabel();
    public function getLexicalForm();
    public function getLanguage();
    public function getDatatype();
    public function toNTriples();

4. Essential Classes

4.1. Class ExtractionGroup

ExtractionGroups combine one or more Extractors with one Destination. If you want to create a single Dump for each Extractor in use, you will need to create a new ExtractionGroup for every Extractor, as an ExtractionGroup can hold only a single Destination.

The most important methods are:

  • __construct($destination, $metadestination = NULL): $destination is an object of a class, implementing the interface Destination. $metadestination is an optional Destination, where meta information can be stored. It is mainly used by the InfoboxExtractor, which collects all predicate names in a metadestination.
  • addExtractor($extractor): Adds a new Extractor the group.

4.2. Class ExtractionJob

An ExtractionJob combines one or more ExtractionGroups (Extractors + Destination)
with one PageCollection (data source). ExtractionJobs are executed by the ExtractionManagaer.

The most important methods are:

  • __construct($pageCollection, $pageTitleIterator): Requires a PageCollection and an iterator. The iterator is required in order to cycle properly over all pages of a PageCollection.
  • addExtractionGroup($group): Adds a new ExtractionGroup to the job.

4.3. Class ExtractionManager

The ExtractioManager executes ExtractionJobs.
Cycling over all ExtractionGroups, the extraction manager first initializes the Extractors and the Destination (via their start() method).

Next, it iterates over all pages from a PageCollection and passes the page source to each Extractor, triggering its extractPage() method.

Finally it reads out the ExtractionResults from every Extractor and passes it to
the respective Destination. The finish() methods from Extractors and the Destination
are called, in order to close them properly.

The most important methods is:

  • execute($job): Starts the job as described above. Requires an object of class ExtractionJob.

4.4. Class ExtractionResult

Collects RDFtriples while extraction is in process. Each wiki page needs an own instance of ExtractionResult.

The most important methods are:

  • __construct($pageID, $language, $extractorID): $pageID: String, containing the English Wikipedia page title. $language: String containing the language. $extractorID: String containing the extractorID.
  • addTriple($s, $p, $o):Adds new Triples to the Result. $s is a String containing the subject, $p the predicate and $o the object.
  • getTriples(): Returns an array of RDFtriples.

4.5. Class RDFtriple

RDFtriple combines a subject, a predicate, and an object, into an RDFtriple. You will usually create RDFnodes via RDFtriples, e.g. $subject = RDFtriple::URI("ResourceName"); as RDFtriple can add a common URI prefix for each triple.

The most important methods are:

  • __construct($subject, $predicate, $object): Requires a string containing the subject, the predicate and the object.
  • toString(): Returns the RDFtriple in NTriples format.
  • static function blank($label): Returns a RDFblankNode. $label must be string containing the Blanknode label.
  • static function URI($uri): Returns an URI. $uri is a string containing the URI.
  • static function literal($value, $datatype = null, $lang = null): Returns an RDFliteral. $value must be of datatype String; else an error message will be thrown, as $value will be serialized for NTriples and this only works with string variables. $datatype and $lang are optional parameters. $datatype is a string containing the (RDF-) datatype of $value, $lang the language.