DBpedia Spotlight – Developers Documentation
The Spotlight annotation task starts with a string that may contain mentions of DBpedia concepts. The general procedure can be broken down in four stages.
1. Spotting: identification of surface forms (substrings of the original input) that may be entity mentions.
2. Candidate Selection: selecting a set of surface forms from step 1 along with the DBpedia resources that are candidate meanings for those surface forms.
3. Disambiguation: deciding on the most likely candidate resource for each selected surface form.
4. Filtering: adjusting the annotations to task-specific requirements according to user-provided configuration.
The output, after the whole procedure is completed, is a set of annotations associating the input string with DBpedia resources that were detected in that string.
Each of the for steps potentially uses some data from DBpedia. We will describe each step with more details (Execution Workflow), and further we explain how each dataset can be generated (Data Generation Workflow).
Execution Workflow
Spotting
Spotting identifies surface forms of interest in the text. We have three implementations for this:
- Ling Pipe Spotter: uses the Ling Pipe Exact Dictionary Chunker. It is based on the Aho-Corasick algorithm. It is memory-extensive (approx. 8G) since it holds a list of known surface forms in memory. It is the most reliable implementation we have so far.
- Trie Spotter: based on a Ternary Interval Search Tree. Not as memory-extensive, but the code is not as mature, e.g. finds overlapping surface forms.
- Ling Pipe Chunk Spotter: uses the Hmm Decoder of Ling Pipe to apply part-of-speech tagging on input sentences. Some hand-crafted rules then build noun-phrase-chunks. These chunks are looked up in the index if they exist (the spotter itself does not have a dictionary of surface forms). If a chunk does not exist in the index as surface form, it attempts to truncate the chunk from the beginning and look it up again. This method does not have good coverage, it aims at very specific concepts and tries to avoid over-annotation.
Candidate Selection
The main objective of the Candidate Selection step is to find candidate DBpedia resources given a surface form. It is also a chance to perform some pre-filtering to speed up the Disambiguation step. The Occurrence-Centric Disambiguators (interface Disambiguator) performs Candidate Selection and Disambiguation in one merged step. The Document-Centric Disambiguators (interface Paragraph Disambiguator) perform candidate selection and use those candidates as a filter for the context scoring step in Disambiguation. With a candidate map loaded in main memory we were able to speed up disambiguation up to 200x in preliminary tests.
Disambiguating
We use a Lucene index and extend Lucene's Similarity class to compare the context of a surface form with the context of all candidate resources for this surface form. The index is trained with different context information (see below). Each document in the index represents all disambiguation information (plus type information) gathered for a DBpedia resource. We merge all context that we can collect for a specific resource into one Lucene document. The disambiguation step queries the Lucene index, asking for the most similar context given a surface form. From the ranking that is returned, we extract the resource URI of the highest ranked document. This resource is the disambiguated concept.
The Indexing Context section contains information on how to create an index.
Filtering
In a last optional step, the returned disambiguations can be configured in order to adjust the output for specific application needs. While some applications require higher precision, others can tolerate a few errors in order to increase recall. The Confidence configuration applies two checks:
1. The similarity score of the first ranked entity must be bigger than a threshold.
2. The gap between the similarity score of the first and second ranked entity must be bigger than a relative threshold.
It is also possible to constrain the annotations to a given subdomain of interest. Users can filter out all DBpedia resources of a given type (blacklisting), or allow only certain types to be annotated (whitelisting). The definition of a subdomain of interest need not be only via types, as arbitrary SPARQL queries are supported by the system.
The class Annotation Filter is responsible for performing the filtering tasks. Sparql Query Executer retrieves query results from SPARQL endpoints via SPARQL Protocol.
See also: User's Manual: Configuration.
Data Generation Workflow
In order to enable the execution of the tasks above, we need to obtain data from DBpedia and text from Wikipedia and perform a series of steps to prepare data for DBpedia Spotlight.
Original Files
Our pipeline starts from the following files:
- DBpedia Labels: Used to have a complete set of all DBpedia URIs.
http://downloads.dbpedia.org/3.5.1/en/labels_en.nt.bz2
- DBpedia Redirects: Used to generate alternative/preferred URIs, as well as surface forms (surrogate mapping)
http://downloads.dbpedia.org/3.5.1/en/redirects_en.nt.bz2
- DBpedia Disambiguation: Used to generate bad URIs as well as surface forms (surrogate mapping)
http://downloads.dbpedia.org/3.5.1/en/disambiguations_en.nt.bz2
- DBpedia Ontology Infobox Types: Used to create type-specific datasets (e.g. paragraphs containing mentions of People)
http://downloads.dbpedia.org/3.5.1/en/instance_types_en.nt.bz2
- DBpedia Mapping Based Properties (Commonly referred to as DBpedia Graph) Used for the graph methods and for understanding our mistakes during evaluation
http://downloads.dbpedia.org/3.5.1/en/mappingbased_properties_en.nt.bz2
- Wikipedia XML Dump: Used as the main source of occurrences (paragraphs, disambiguation sentences and definition pages) – enwiki-20100312-pages-articles.xml
- The Web (via Yahoo!Boss): Used as alternative source of occurrences (paragraphs mentioning DBpedia resources, i.e. links to Wikipedia pages that are resources)
Generated Datasets
Concept URIs, Redirects and Surface Forms
The class Import DBpedia Data was used to process the DBpedia files (redirects and disambiguation) to detect chains of redirects and URLs that are not valid URIs (e.g. disambiguation pages).
The following files are generated. All necessary functions are in util.SurrogatesUtil
- Concept URIs (a list): 1. Start with labels of all pages. 2. Subtract all redirect URIs. 3. Subtract all disambiguation URIs. This results in 3,045,254 Concept URIs.
- Redirect Map: We compute the transitive closure of redirects, collecting every URI that does not have content, but rather points to another DBpedia URI. This file stores a map from redirect URI to concept URI (saved as TSV file).
- Candidate Map: a map from surface form to candidate Concept URI (saved as TSV file; exportable to NT format). This results in 7,305,457 associations. That is a mean of 7,305,457/3,045,254=2.4 surface forms per URI, and a mean ambiguity of XXXXX candidate URIs per Surface Form.
Wikipedia Paragraph Occurrences
The class used to extract DBpediaResourceOccurrences from mentions of resources in Wikipedia is Wiki Occurrence Source. We take the paragraph in which a wiki link occurs as context, the anchor text of the link as surface form and the target Wikipedia page of the link as resource URI.
There is the option of extracting clean occurrences for applying an Occurrence Filter that leverages some of the files above to keep only concept URIs, to normalize redirect URIs to concept URIs and to only allow surface forms that are in the surface form dictionary for a given URI (this avoids littering the surface forms space with here or USA for 300 URIs). This is possible for all Occurrence Sources.
The occurrence ID consists of the encoded Wikipedia page title, the paragraph number and the wiki page link number in this paragraph.
For example, for the eighth paragraph, wiki link number three on the Wikipedia page of Apple Inc, the occurrence looks like this:
- id: Apple_ Inc.-p 8 l 3
- resource: IBM
- surface form: IBM
- context: By the end of the 1970s, Apple had a staff of computer designers and a production line. The company introduced the ill-fated Apple III in May 1980 in an attempt to compete with IBM and Microsoft in the business and corporate computing market.
- textOffset: 179
Wikipedia Definition Pages
The class used to extract DBpediaResourceOccurrences from the Wikipedia definition pages is Wiki Page Context Source. We take the complete page text as context, the label of the page (minus brackets) as surface form and the URI of the page as resource.
Occurrence Filtering does not make a lot of sense here.
Wikipedia Disambiguation Sentences
The class used to extract DBpediaResourceOccurrences from the Wikipedia disambiguation pages is Disambiguation Context Source. We check all sentences that occur with a bullet point in front. If a disambiguation sentence contains a link which anchor text contains the disambiguation page title (or the title is an acronym of the anchor text), we take the disambiguation sentence as context, the anchor text of the link as surface form and the target Wikipedia page as resource.
Occurrence Filtering as explained for Wikipedia Paragraph Occurrences can be applied too.
Web Paragraphs
Starting from the Concept URIs that we have collected, we use Yahoo! Boss to search for pages that contain links to Wikipedia articles of these concepts on the web. Then we extract the paragraphs in which the links are as context, take the anchor text as surface form and the target URI as resource. The surface forms are littered with here, Wikipedia article and the like, so they should be configured in another step.
Model
We use Occurrence objects to model mentions of resources with the context in which they have been mentioned. There are two types of occurrences: DBpediaResourceOccurrences and Surface Form Occurrences. The difference is that the latter is not disambiguated and therefore lacks some information that the former has.
- Surface Form Occurrence
- surfaceForm: Surface Form object that hold as name the surface form string
- context: Text object that hold as text the context string
- textOffset: integer that describes the position of the surface form in the text
- (provenance, but it is not utilized anywhere)
Surface Form Occurrence objects are returned from the spotter or from a Parse Surface Form Text (in the case of ommitting spotting). They are not assigned a resource.
- DBpediaResourceOccurrence
- id: a string that holds a unique identifier of occurrences from Wikipedia or the Web (see below). Adding an identifier to each occurrence allows us to track them through the pipeline. For example, indexing or evaluation code can stop, track the last id and restart.
- resource: DBpediaResource object that it holds
- uri: the string of the URI
- support: integer that means the number of Wikipedia inlinks
- types: list of DBpediaType objects. Each of these objects holds a string with a type from the [http://wiki.dbpedia.org/Ontology DBpedia Ontology]. The list is sorted from most general (without OWL#Thing) to most specific.
- surface form: Surface Form object that hold as name the surface form string
- context: Text object that hold as text the context string
- textOffset: integer that describes the position of the surface form in the text
- (provenance, but it is not utilized anywhere)
- similarity score: double that holds the similarity score returned by Lucene
- percentage of second ranked resource: double that holds (sim_score_second_rank / sim_score_first_rank)
For training data, i.e. for data from Wikipedia, that is used to produce our disambiguation index, the last two attributes are not relevant. After the disambiguation step, however, we can assign the most probable DBpedia resource, a similarity score for the relation and a second score that captures the gap of similarity scores between the assigned resource and the second ranked resource.
Indexing Context
The Wikipedia Paragraph Occurrences describes how to get context for disambiguation and how to optionally configurate this context to reduce noise. We save the context collected to a TSV file. Then we sort this TSV file by the resource URI, so that all DBpediaResourceOccurreces of the same URI appear in a row. This is important for indexing, as it it a lot faster to produce a Lucene document of all occurrences merged together in memory and then write it to disk, than it is to merge documents on disk with other occurrences. Therefore we suggest the use of Merged Occurrences Context Indexer for efficiency.
When the disambiguation context is created, we advise to add to the index additional surface forms and DBpedia types. Indexing the surface forms again ensures that a resource can be a candidate of all its possible surface forms, even if we never saw a valid surface form in the data. Indexing DBpedia types enables the possibility to only retrieve certain types. The class Index Enricher can do the adding of the additional data.
Take a look at the file bin/index.sh in our distribution for more practical advice on how to run commands.
Indexing Spotter Dictionary
The class used for creating the spotter dictionary is Index Ling Pipe Spotter. The input for indexing the spotter is the surface forms TSV file from the section on Redirects and Surface Forms. You should index sorted by URI. There is also the option to index the surface forms from the index.
Information
Last Modification:
2011-09-28 15:43:40 by Pablo Mendes