DBpedia Spotlight – User's Manual
DBpedia Spotlight is a tool for annotating mentions of DBpedia concepts in plain text.
We offer three basic functions: Annotate, Disambiguate and Candidates (Best K). They can be accessed from a Scala / Java API, REST Web Service and from a user interface on the Web (HTML/Javascript). For the Scala / Java API, there are a number of configuration parameters that can be used to instruct the annotation and disambiguation functions. The classes Default Annotator, Default Disambiguator and Default Paragraph Disambiguator offer the configuration that we found to provide the best results. The configuration interface offers ways to control the quality of the output of the two above tasks.
Architecture
The DBpedia Spotlight Architecture is composed by the following modules:
- Web application, a demonstration client (HTML/Javascript interface) that allows users to enter/paste text into a Web browser and visualize the resulting annotated text.
- Web Service, a RESTful/SOAP? Web API that exposes the functionality of annotating and/or disambiguating entities in text.
- Annotation Java / Scala API, exposing the underlying logic that performs the annotation/disambiguation.
- Indexing Java / Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.
- Evaluation module, where we test disambiguators, log results and use those to train our system to perform better.
External dependencies:
- DBpedia Extraction Framework, (only for the index module) extracting the necessary data from the Wikipedia dumps.
- Lucene 2.9.3, providing the low level indexing framework used by DBpedia Spotlight.
- Ling Pipe 4.0.0, providing the string matching implementation used for the Spotter module.
-
System Requirements
- Java 1.6+
- Scala 2.8+
- Spotlight JAR
- Spotlight Library JARs
- Lucene disambiguation index
- Spotter dictionary
- large RAM to set the heap size big enough for the Spotter (approx. 8G)
- Maven 2 for the automagic installation of dependencies.
- Indexing Java / Scala API, executing the data processing necessary to enable the annotation/disambiguation algorithms used.
Programmatic usage
If you want to use DBpedia Spotlight in your Java / Scala code, take a look at core/SpotlightFactory to see how you can create your objects, and then look at rest/Candidates.java to see how you can wire them together.
Online Usage
Web Application
The Web Application is located at
http://spotlight.dbpedia.org/demo/index.xhtml .
Web Service
The Web Service is located at
http://spotlight.dbpedia.org/rest/annotate and
http://spotlight.dbpedia.org/rest/disambiguate.
It is also possible to query it with the filtering parameters specified above. Examples calls are provided below.
Content Negotiation
You can request different types of output by setting the Accept
request header.
For example, in order to request JSON output, you can add "Accept:application/json" to the request headers.
One example using cURL:
The content types we currently support are:
- «text/html»
- «application/xhtml+xml»
- «text/xml»
- «application/json»
The application/xhtml+xml comes with embedded RDFa that you can give to the
RDFa Distiller and get RDF triples in Turtle, RDF+XML, etc. as output.
If your input text is long, you may prefer using POST instead of GET.
-H «content-type:application/x-www-form-urlencoded" \
-d «disambiguator=Document&confidence=-1&support=-1&text=President%20Obama%20called%20Wednesday%20on%20Congress%20to%20extend%20a%20tax%20break%20for%20students%20included%20in%20last%20year%27s%20economic%20stimulus%20package» \
Please not that you *must* use content-type application/x-www-form-urlencoded for POST requests.
Example 1: without type restriction
returns the XML
for students included in last year's economic stimulus package, arguing that the policy
provides more generous assistance."
confidence="0.2" support="20">
similarityScore="0.31504717469215393" percentageOfSecondRank="-1.0"/>
<Resource URI="
http://dbpedia.org/resource/United_States_Congress"
similarityScore="0.2348192036151886" percentageOfSecondRank="0.8635579006818564"/>
<Resource URI="
http://dbpedia.org/resource/Tax_break"
similarityScore="0.35041093826293945" percentageOfSecondRank="-1.0"/>
<Resource URI="http://dbpedia.org/resource/Student"
support="1701" types= surfaceForm="students" offset="71"
similarityScore="0.32534149289131165" percentageOfSecondRank="-1.0"/>
<Resource URI="
http://dbpedia.org/resource/Policy"
similarityScore="0.3228176236152649" percentageOfSecondRank="-1.0"/>
Example 2: with type restriction
returns the XML
for students included in last year's economic stimulus package, arguing that the policy
provides more generous assistance."
confidence="0.2" support="20" types="Person,Organisation">
similarityScore="0.31504717469215393" percentageOfSecondRank="-1.0"/>
<Resource URI="
http://dbpedia.org/resource/United_States_Congress"
similarityScore="0.2348192036151886" percentageOfSecondRank="0.8635579006818564"/>
Example 3: with SPARQL restriction
returns the XML
for students included in last year's economic stimulus package, arguing that the policy
provides more generous assistance."
confidence="0.2" support="20"
sparql="SELECT DISTINCT ?x WHERE { ?x a <http://dbpedia.org/ontology/OfficeHolder>; .
?x ?related <http://dbpedia.org/resource/Chicago>; }"
policy="whitelist">
similarityScore="0.2730408310890198" percentageOfSecondRank="-1.0"/>
Example 4: Candidates Interface
Input:
The parameters are the same as in /annotate, but you will send your request to
http://spotlight.dbpedia.org/rest/candidates
Output example:
<surfaceForm name="individuals" offset="67">
<resource label="The Individuals (New Jersey band)" uri="The_Individuals_%28New_Jersey_band%29" contextualScore="0.011762913316488266" percentageOfSecondRank="-1.0" support="17" priorScore="0.0" finalScore="0.011762913316488266"/>
<resource label="The Individuals (Chicago band)" uri="The_Individuals_%28Chicago_band%29" contextualScore="0.0" percentageOfSecondRank="-1.0" support="0" priorScore="0.0" finalScore="0.0"/>
</surfaceForm>
<surfaceForm name="officials" offset="233">
<resource label="Rugby league match officials" uri="Rugby_league_match_officials" contextualScore="0.04376954212784767" percentageOfSecondRank="-1.0" support="9" priorScore="0.0" finalScore="0.04376954212784767"/>
</surfaceForm>
<surfaceForm name="President Obama" offset="0">
</surfaceForm>
<surfaceForm name="1 million" offset="97">
</surfaceForm>
<surfaceForm name="percentage" offset="156">
</surfaceForm>
<surfaceForm name="earnings" offset="176">
</surfaceForm>
<surfaceForm name="taxpayers" offset="194">
<resource label="Tax Payers' Alliance" uri="Tax Payers%27_Alliance" contextualScore="0.12765906751155853" percentageOfSecondRank="-1.0" support="15" priorScore="0.0" finalScore="0.12765906751155853"/>
<resource label="The Taxpayer (Luxembourg)" uri="The_Taxpayer_%28Luxembourg%29" contextualScore="0.024930020794272423" percentageOfSecondRank="-1.0" support="3" priorScore="0.0" finalScore="0.024930020794272423"/>
<resource label="The Taxpayers" uri="The_Taxpayers" contextualScore="0.0" percentageOfSecondRank="-1.0" support="0" priorScore="0.0" finalScore="0.0"/>
</annotation>
Information
Last Modification:
2011-09-29 18:28:46 by Pablo Mendes