Extractor
An extractor is a mapping from a page node to a graph of statements about it. All relevant classes are located in the
org.dbpedia.extraction.extractors.
1. Overview

2. Available Extractors
2.1. Label Extractor
Extracts labels to articles based on their title.
Supported languages: All languages
2.2. Mapping Extractor
Extracts structured data based on hand-generated mappings of Wikipedia infoboxes to the DBpedia ontology. Mappings can be edited via the
Mappings Wiki.
Supported languages: All languages, for which mappings are available.
2.3. Infobox Extractor
This extractor extracts all properties from all infoboxes. Extracted information is represented using properties in the
http://dbpedia.org/property/ namespace. The names of the these properties directly reflect the name of the Wikipedia infobox property. Property names are not cleaned or merged. Property types are not part of a subsumption hierarchy and there is no consistent ontology for the infobox dataset. The infobox extractor performs only a minimal amount of property value clean-up, e.g., by converting a value like June 2009 to the XML Schema format 200906. You should therefore use the infobox dataset only if your application requires complete coverage of all Wikipeda properties and you are prepared to accept relatively noisy data.
Supported languages: All languages
2.4. Wiki Page Extractor
Extracts links to corresponding Articles in Wikipedia.
Supported languages: All languages
2.5. Page Links Extractor
Extracts internal links between DBpedia instances from the internal pagelinks between Wikipedia articles. The page links might be useful for structural analysis, data mining or for ranking DBpedia instances using Page Rank or similar algorithms.
Supported languages: All languages
2.6. Geo Extractor
Extracts geographic coordinates.
Supported languages: All languages
2.7. Article Categories Extractor
Extracts links from concepts to categories using the SKOS vocabulary.
Supported languages: en
2.8. Category Label Extractor
Extracts labels for Categories.
Supported languages: en
2.9. Image Extractor
Extracts the first image of a Wikipedia page. Constructs a thumbnail from it, and the full size image.
Supported languages: en
2.10. External Links Extractor
Extracts links to external web pages.
Supported languages: All languages
2.11. Homepage Extractor
Extracts links to the official homepage of an instance.
Supported languages: en, de, fr
2.12. Disambiguation Extractor
Extracts disambiguation links.
Supported languages: All languages
2.13. Persondata Extractor
Extracts information about persons (date and place of birth etc.) from the English and German Wikipedia, represented using the FOAF vocabulary.
Supported languages: en, de
2.14. Pnd Extractor
Extracts PND (Personennamendatei) data about a person. PND is published by the German National Library. For each person there is a record with his name, birth and occupation connected with a unique identifier, the PND number.
Supported languages: en, de
2.15. Skos Categories Extractor
Extracts information about which concept is a category and how categories are related using the SKOS Vocabulary.
Supported languages: en
2.16. Redirect Extractor
Extracts redirect links between Articles in Wikipedia.
Supported languages: All languages
3. Using an Extractor
As Extractor is a first-class function, it is very easy to use. All you have to do is to call it with the page node.
As all extractors are thread-safe, it is safe to call them from multiple threads without further synchronization.
4. Implementing new Extractors
In order to implement a new extractor, all that is needed is to inherit from the Extractor class and to implement the extract method, which takes three arguments:
- page : PageNode : The page node represents the root of the Abstract Syntax Tree (AST), that represents the current MediaWiki page.
- subjectUri : String : This is the URI of the instance which is currently extracted.
- context : PageContext : The page context holds the mutable state of the current page extraction. Among other things, it can be used to generate URIs.
The extracted statements are returned as a Graph from the extract method.
Note that each Extractor must be thread-safe.
5. Related Extractors
Other projects may reuse and/or extend extractors from DBpedia. For example, the DBpedia Spotlight extraction pipeline contains extractors for mentions of DBpedia Resources within Wikipedia paragraphs. For more info, see the Data Generation page for the project.
Information
Last Modification:
2011-11-08 17:17:25 by Pablo Mendes