Source
The Source package provides an abstraction over a source of MediaWiki pages. All classes are located in the namespace
org.dbpedia.extraction.sources.
1. Overview

The Source class inherits from scala.collection.Traversable, which is Scala's abstraction over a collection whose elements can be enumerated. A source does not necessarily have a definite size (e.g. a streaming source). If hasDefiniteSize returns true, the collection is certainly finite.
Pages are represented by the WikiPage class. It contains the following attributes:
- title : WikiTitle : The title of the page.
- id : Int : The MediaWiki page ID.
- revision : Long : The revision of the page.
- source : String : The WikiText source of the page.
2. Available Sources
2.1. XMLSource
Reads pages from a Media Wiki XML dump. It accepts the official Media Wiki export format
http://www.mediawiki.org/xml/export-0.4/. An optional filter may be provided to skip specific pages.
It accepts two arguments:
- file : java.io.File: The location of the dump file in the filesystem.
- filter : (Wiki Title => Boolean) (optional): A filter function to filter pages by their title. Pages for which this function returns false, won't be yielded by the source. If no filter is provided, all pages are returned.
2.2. WikiSource
A WikiSource fetches the pages directly from a MediaWiki. For this purpose, it uses the MediaWiki API.
It accepts three arguments:
- url : URL (optional): The URL of the MediaWiki API. The fault is
http://en.wikipedia.org/w/api.php, which will retrieve pages from the english Wikipedia.
- language : Language (optional): The Language of the Media Wiki. Note that this is not only used for the language of the text, but also to handle various Media Wiki features such as Namespaces (which may be represented by different prefixes in the various languages)
- namespaces : Set[Namespace] (optional): A set of namespaces to be retrieved. The Wiki Source will only yield pages which can be found in one of the provided namespaces.
2.3. FileSource
Reads wiki pages from text files in the file system. Given a specific directory, this sources iterates through all files in the directory itself and all of its subdirectories. Page titles are generated from the file name relative to the base directory. An optional filter may be provided to skip specific files.
It accepts three arguments:
- file : java.io.File : The directory, which contains the source files.
- filter : (String => Boolean) (optional): A filter function to exclude specific file names. File names for which this function returns false, won't be read by the source. If no filter is provided, all files in the directory, including subdirectories, which do not start with a dot, are read.
- language (optional): The language used in the sources. TODO link to utilities (Language class)
2.4. MemorySource
A source which yields pages from a user-defined container.
Its only argument is:
- pages : Traversable[WikiPage] : A user-defined container holding the pages.
2.5. CompositeSource
A source which is composed of multiple child sources. Iterates through all pages of all child sources.
Its only argument is:
- sources The sources, this source is composed of.
3. Using a Source
As Source inherits from Traversable it can be used in the same way as any other scala collection.
All pages can be enumerated using a for-comprehension:
for(page <– source) process(page)
To extract all pages of a source, you can use map-reduce:
source.map(parser).map(extractor).reduceLeft(_ merge _)
4. Adding new Sources
In order to create a new source, two methods should be overriden:
- foreach[U](f : scala.Function1[WikiPage, U]) : Unit iterates through all pages and calls the provided function on each page.
- hasDefiniteSize : scala.Boolean Return true, if the collection is finite. Otherwise, false.
Information
Last Modification:
2010-03-11 14:07:51 by Robert Isele