TextExt - DBpedia Open Extraction Challenge

Disclaimer: The call is under constant development, please refer to the news section. We also acknowledge the initial engineering effort and will be lenient on technical requirements for the first submissions and will focus evaluation on the extracted triples and allow late submissions, if they are coordinated with us.

 

Background

DBpedia and Wikidata currently focus primarily on representing factual knowledge as contained in Wikipedia infoboxes. A vast amount of information, however, is contained in the unstructured Wikipedia article texts. With the DBpedia Open Text Extraction Challenge, we aim to spur knowledge extraction from Wikipedia article texts in order to dramatically broaden and deepen the amount of structured DBpedia/Wikipedia data and provide a platform for benchmarking various extraction tools.

Mission

Wikipedia has become the ubiquitous source of knowledge for the world enabling humans to lookup definitions, quickly become familiar with new topics, read up background infos for news event and many more - even settling coffee house arguments via a quick mobile research. The mission of DBpedia in general is to harvest Wikipedia’s knowledge, refine and structure it and then disseminate it on the web - in a free and open manner - for IT users and businesses.

News and next events

Twitter: , Hashtag: #dbpedianlp

Coming soon:

Methodology

The DBpedia Open Text Extraction Challenge differs significantly from other challenges in the language technology and other areas in that it is not a one time call, but a continuous growing and expanding challenge with the focus to sustainably advance the state of the art and transcend boundaries in a systematic way. The DBpedia Association and the people behind this challenge are committed to provide the necessary infrastructure and drive the challenge for an indefinite time as well as potentially extend the challenge beyond Wikipedia.

We provide the extracted and cleaned full text for all Wikipedia articles from 9 different languages in regular intervals for download and as Docker in the machine readable NIF-RDF format (Example for Barrack Obama in English). Challenge participants are asked to wrap their NLP and extraction engines in Docker images and submit them to us. We will run participants’ tools in regular intervals in order to extract:

  1. Facts, relations, events, terminology, ontologies as RDF triples (Triple track)

  2. Useful NLP annotations such as pos-tags, dependencies, co-reference (Annotation track)

We allow submissions 2 months prior to selected conferences (currently http://ldk2017.org/ and http://2017.semantics.cc/ ). Participants that fulfil the technical requirements and provide a sufficient description will be able to present at the conference and be included in the yearly proceedings. Each conference, the challenge committee will select a winner among challenge participants, which will receive 1000€.

Results

Every December, we will publish a summary article and proceedings of participants’ submissions at http://ceur-ws.org/ . The first proceedings are planned to be published in Dec 2017. We will try to briefly summarize any intermediate progress online in this section.

 

Acknowledgements

We would like to thank the Computer Center of Leipzig University to give us access to their 6TB RAM server Sirius to run all extraction tools.

The project was created with the support of the H2020 EU project HOBBIT (GA-688227) and ALIGNED (GA-644055) as well as the BMWi project Smart Data Web (GA-01MD15010B).

 

Challenge Committee

  • Sebastian Hellmann, AKSW, DBpedia Association, KILT Competence Center, InfAI, Leipzig

  • Sören Auer, Fraunhofer IAIS, University of Bonn

  • Ricardo Usbeck, AKSW, Simba Competence Center, Leipzig University

  • Dimitris Kontokostas, AKSW, DBpedia Association, KILT Competence Center, InfAI, Leipzig

  • Sandro Coelho, AKSW, DBpedia Association, KILT Competence Center, InfAI, Leipzig

 

Contact Email: dbpedia-textext-challenge@infai.org

 

For Organisations

Why support this challenge?

Wikipedia is a rich textual source of knowledge. By running this challenge we will innovate knowledge extraction engines to receive more and better data in a multitude of domains such as medical data, finances, transportation, points of interest, culture and events, to name a few only. This data will be published for download under an open licence (CC-BY) and is available for your organisation for any use under the sole condition that you acknowledge DBpedia and the individual contributors. Eventually, the challenge will also develop more fine-grained quality measures and criteria, which will further help you to judge whether you can trust the data.

Here is how you can support the challenge:

Basic support

Become a member of the DBpedia Association, which provides the infrastructure for DBpedia as whole and this challenge in particular.

Domain-specific support

You can sponsor prize money for individual subtracks and tasks over Wikipedia content. Such a task could be extracting interdependencies of drugs, relations between politicians, historical events or even data that answers a concrete question. You name it!

We will highlight such domain-specific challenges in our dissemination, successful data extraction depends, of course, on the difficulty of the task and the participants. Cost is the prize money starting from 500€ (you choose amount and duration). Furthermore, your organisation is required to be (or become) a member of the DBpedia Association.

Supply your own text

If your organisation is interested in fact extraction and annotations on textual source other than Wikipedia, we can support you to include any textual content in the challenge. Such text could be medical source such as PubMed, law text (in diverse languages) or your own content. Cost will be a one time fee starting from 5000€, which covers the integration of the text into the infrastructure, where it will be available to all participants for all future extraction runs.

Please contact the Challenge Committee for details: dbpedia-textext-challenge@infai.org

 

For Participants

Submission deadlines

For each conference, we have different deadlines:

Tool execution Phase I and II: we have access to the server in two separate weeks before the conference. Tool execution starts 3 months before each conference.

Bug fixing Phase: in parallel and up to 1 week after Tool execution I

Final submission deadline: 1 week after Tool execution II

The deadline for each iteration of the challenge will be in three intervals before the conferences below:

Notification: 2 weeks after final submission

Conference: Upon acceptance, we expect that you will present your approach and results at the conference (eligible for early bird ticket)

 

Notes and FAQs:

  • Yes, if you miss the first deadline you can still submit your approach for Tool Execution Phase II and skip bug fixing.

  • Since this is the first year, participants will be allowed to run the extraction themselves and submit data and paper only (Final submission: Mon 24 April, midnight, Hawaii time). A Docker container will be required for the final proceedings in December and for repeatability.

  • Camera-ready deadline will be once a year on the 1st of December

 

Language, Data and Knowledge (LDK), 19-20 June 2017 in Galway, Ireland

  • Submission of tool in Docker: Sun 19 March, midnight, Hawaii time

  • Tool Execution I: Mon 20 March - Sun 26 March

  • Bug fixing: Contact as soon as results are available until Sun 2 April, midnight, Hawaii time

  • Tool Execution II: Mon 3 April - Sun 9 April, results are available to participants

  • Final submission: Mon 24 April, midnight, Hawaii time

  • Notification: Mon 1 May

  • Conference: 19-20 June 2017 (exact date of session will be updated soon)

 

SEMANTiCS, 11-14 September 2017 in Amsterdam, Netherlands

  • Submission of tool in Docker:  Sun 11 June, midnight, Hawaii time

  • Tool Execution I: Mon 12 March - Sun 18 March

  • Bug fixing: Contact as soon as results are available until Sun 25 June, midnight, Hawaii time

  • Tool Execution II: Mon 26 June - Sun 2 July, results are available to participants

  • Final submission: Mon 17 July, midnight, Hawaii time

  • Notification: Mon 31 July

  • Conference: 11-14 September 2017 (exact date of session will be updated soon)

Tracks

Multi-track submissions are allowed and recommended

Triples Track (Knowledge extraction)

Main goal of this submission is one or more files in NTriples with the resulting facts extracted from the Wikipedia article text. We will evaluate the triples under the following criteria:

  • Quantity of extracted data

  • Quality of extracted data

    • Correctness: we will ask each participant to evaluate a certain amount of triples from other participants. The triples are mixed, so participants do not know, whether they evaluate their own or other participants triples

    • Fitness for use: we ask you to create a value proposition in the accompanying paper to persuade us, why you should be the winner. Main criteria is that you extracted data that is suitable for a certain use case and its requirements: the better your data is fit to fulfil the requirements, the better it will be evaluated

    • Consistency and conciseness, i.e. no conflicts and less heterogeneity

  • Type of extraction: Besides facts, we are also looking for terminology and dictionaries, ontological knowledge (new types, taxonomies, axioms, domain/range)

  • Language diversity: extracting from more than one language

  • Ability to keep proper provenance in the NIF format given below.

 

Please also note the technical requirements below.

Annotation Track

The main goal of the annotation track are NLP annotation over the article text. We are looking for any kind of useful annotation, that helps unlocking the text. Such annotations can be from any level of linguistic annotation, e.g. lemmata (morphology), pos tags, dependencies, co-reference, parsetrees, NER, NEL, spellchecks, language statistics. All submitted annotations are required to use NIF-RDF format as the basis for anchoring annotations to text.

The annotation track is an open submission and we will rely on participants to describe how they evaluated the quality of their submission. Criteria from the Triple track apply. As there will only be one prize per conference, participants might consider focusing on the triple track primarily and submit any extra annotations along with the facts to strengthen their overall chances.

 

Architecture

Participants of the challenge are required to wrap their extraction software into a Docker image. We will run all Docker images in regular intervals (Tool execution phase I and II) and publish the resulting data under an open license on our 25 TB DBpedia download server. For running your docker images the computer center of Leipzig University has given us access to one of their servers with the following specs: CPU: 8x 16-Core, RAM: 6TByte, HDD: 2x600GB SAS(Raid1) + 5x2TB SAS (RAID 6), i.e. the HDD and RAM have the same size in order to run all processes in memory.

Paper guidelines

Each submissions should be accompanied with an article of 4-10 pages. We are aware that the extraction/annotation system was likely covered in previous publications. We therefore do not expect the approach to be original and would like authors to focus on a concise, self-contained description that contains all necessary information to reproduce the results, including the description of parameters and tuning as well as licenses. Although, we strongly encouraged, that the described tools or systems are free, open, and accessible on the Web, it is not a requirement. Closed, commercial tools must still be wrapped in docker before camera-ready in December. These submissions will be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system, convincing evidence must be found in the resulting triples/annotations. (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

In particular, submissions should describe:

  • The extraction component

  • All additional data sources ingested by the extraction component

  • The experiment, including the result files.

  • Arguments in favor of the results including but not limited to:

    • showcasing interesting queries

    • data quality measurements

    • description of novel facts complementary to existing DBpedia data

Submission URL: https://easychair.org/conferences/?conf=textext2017

Details about paper layout will be posted soon.

 

Technical Requirements

Please follow the technical requirements.

 

Input data

  • Wikipedia article texts are provided in RDF (NIF Format) for nine languages

  • Participants are allowed to use any available additional data. Data sources have to be named explicitly in the accompanying paper.

 

Format of extracted triples

IRIs & ontologies

The extracted triples should use existing DBpedia IRIs, ontology properties and ontology classes as much as possible. When this is not possible, the authors are encouraged to use existing IRIs from the LOD cloud or ontologies from LOV. When this is not possible as well, the authors should mint their own IRIs under their own domain or, use a DBpedia subdomain like http://projectID.dbpedia.org/[resource|ontology] .

Triples

Example reusing DBpedia identifiers and ontology

# the NIF input data can be found here: https://github.com/NLP2RDF/DBpediaOpenDBpediaTextExtractionChallenge/blo...

# “Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. “ Source: http://dbpedia.org/resource/Barack_Obama

 

# NOTE: example is in turtle, but we require NTriples, compressed with bz2

@prefix dbo: <http://dbpedia.org/ontology/> .

@prefix dbr: <http://dbpedia.org/resource/> .

 

dbr:Barack_Obama a dbo:Person .

dbr:Barack_Obama  rdfs:label "Barack Obama"@en .

dbr:Barack_Obama  dbo:alumni  dbr:Columbia_University .

dbr:Columbia_University a dbo:EducationalInstitution .

dbr:Columbia_University rdfs:label "Columbia University"@en .

Example with own identifiers

# NOTE: example is in turtle, but we require NTriples, compressed with bz

@prefix myo: <http://triplr.dbpedia.org/ontology/> .

@prefix my: <http://triplr.dbpedia.org/resource/> .

 

my:e1278432 a myo:Person .

my:e1278432  rdfs:label "Barack Obama"@en .

my:e1278432  myo:graduate-of  my:e7363256 .

my:e7363256 a myo:University .

my:e7363256 rdfs:label "Columbia University"@en .

Textposition, provenance and confidence for triples

@prefix ann: <http://triplr.dbpedia.org/resource/> .

# NIF Context URL and prefix must be reused from the input data

<http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=context>

   nif:beginIndex "0"^^xsd:nonNegativeInteger ;

   nif:endIndex "70213"^^xsd:nonNegativeInteger ;

   nif:isString """... Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he was president of the Harvard Law Review….”””

 

######## AnnotationUni #############

# Create one annotationUnit per extracted triple

 

# triple representation with confidence and provenance

ann:annotation3632832   a nif:AnnotationUnit ;
  nif:subject dbr:Obama ;

nif:predicate dbo:alumni  ;

nif:object dbr:Columbia_University ;

  nif:confidence "0.90"^^xsd:decimal ;

# optional detailed comments and provenance

rdfs:comment “debug and technical messages can go here”  ;

nif:provenance [

      prov:startedAtTime "2015-12-19T00:00:00Z"^^xsd:dateTime ;

      prov:endedAtTime "2015-12-19T00:00:02Z"^^xsd:dateTime ;

      prov:wasAssociatedWith ex:entityToolA

 ].

 

# You may also provide further information and point to the exact words & phrases this triple was computed

 

######## TextSpans in NIF ############

# text occurrence of Obama

<http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=phrase&char=333,352>

# link to annotation Unit

  nif:annotationUnit ann:annotation3632832 ;

# remaining NIF properties

   nif:anchorOf "Obama" ;

   nif:beginIndex "310"^^xsd:nonNegativeInteger ;

   nif:endIndex "315"^^xsd:nonNegativeInteger ;

   nif:referenceContext <http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=context> ;

   a nif:Word .

 

# text occurrence of is graduate of

<http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=phrase&char=333,352>

# link to annotation Unit

  nif:annotationUnit ann:annotation3632832 ;

# remaining NIF properties

   nif:anchorOf "is graduate of" ;

   nif:beginIndex "316"^^xsd:nonNegativeInteger ;

   nif:endIndex "332"^^xsd:nonNegativeInteger ;

   nif:referenceContext <http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=context> ;

   a nif:Phrase .

 

# text occurrence of Columbia University

<http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=phrase&char=333,352>

# link to annotation Unit

  nif:annotationUnit ann:annotation3632832 ;

# remaining NIF properties

   nif:anchorOf "Columbia University" ;

   nif:beginIndex "333"^^xsd:nonNegativeInteger ;

   nif:endIndex "352"^^xsd:nonNegativeInteger ;

   nif:referenceContext <http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=context> ;

   a nif:Phrase .
 

Format of NLP annotations

# create nif URI and add 5 properties:

<http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=phrase&char...
    nif:anchorOf "U.S. military intervention in Iraq" ;
    nif:beginIndex "3401"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
    nif:endIndex "3435"^^<http://www.w3.org/2001/XMLSchema#nonNegativeInteger> ;
    nif:referenceContext <http://en.dbpedia.org/resource/Barack_Obama?dbpv=2016-10&nif=context> ;
    a nif:Phrase ;

# add arbitrary annotations directly (if you only have one)
    itsrdf:taIdentRef <http://dbpedia.org/resource/American-led_intervention_in_Iraq_(2014–present)> .

# use annotationUnits, if you have several alternative annotations of the same type

 

 

 

Contact