Google Summer of Code 2014 / joint proposal for DBpedia and DBpedia Spotlight




Almost every major Web company has now announced their work on a knowledge graph, including Google’s Knowledge Graph, Yahoo!’s Web of Objects, Walmart Lab’s Social Genome, Microsoft's Satori Graph / Bing Snapshots and Facebook’s Entity Graph.

DBpedia is a community-run project that has been working on a free, open-source knowledge graph since 2006. DBpedia currently exists in 97 different languages, and is interlinked with many other databases (e.g. Freebase, New York Times, CIA Factbook) and hopefully, with this GSoC to Wikidata, too. The knowledge in DBpedia is exposed through a set of technologies called Linked Data. Linked Data has been revolutionizing the way applications interact with the Web. While the Web2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages.

One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex querie with SQL-like query languages (e.g. SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.

This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. This is a very simple way to connect “unstructured” text to a structure (hierarchy of tags). For more advanced examples, see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Identifiers and data provided by DBpedia were greatly involved in creating this knowledge graph. Or, more recently, did you see that IBM's Watson used DBpedia data to win the Jeopardy challenge?

DBpedia Spotlight is an open source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking 3.64 million “things” of 320 different “types” with over half a billion “facts” (July 2011).

After a successful GSoC2013 with DBpedia Spotlight, this year we join forces with the DBpedia Extraction Framework and other DBpedia-family products. We got excited with our new ideas, we hope you will get excited too!

1 Steps for candidate students

If you are a GSoC student who want to apply to our Organization, here's a rough guideline on the steps to follow:

  • Subscribe to the DBpedia-GSoC mailing list. All GSoC related questions (ideas, proposals, technical, etc) must go through this list. This it easier for you to search through the archives and for us to follow the discussion.
  • Introduce yourself in the list.
  • Read carefully all the ideas we propose and see if any of these suits you. Also note that you can also submit your own idea.
  • The final goal in your proposal is to convince us that you understood how you will handle this and to have a specific code-plan so, get as much information as possible for the ideas you like. To do this you can search in the GSoC Archives or ask questions to the GSoC mailing list. Please send a separate mail for each idea question to make it easier for other students to follow.
    • Once you get help, a nice-to-do would be to add the archive thread link back to the idea page. This will reduce the mentor's effort to repeat themselves and let them focus on giving great answers.
  • Work on some of the Warm-Up task we suggest.
  • Write your proposal.
  • For GSoC related queries you should look at the Google-Melange help page and the student guide they prepared.

2 Guidelines

As a general rule, we will treat the money Google is going to give us as if it we would have to pay it ourselves. Therefore you should aim in your proposal to 

  1. convince all mentors that your proposal is worth receiving the money
  2. argue the benefit of your proposal for the DBpedia + Spotlight project

3 Warm Up tasks

These are tasks that potential students might want to try in order to (1) get a feeling for the code (2) learn initial skills (3) get in contact with the community (4) earn an initial good standing.

We already prepared a few warm-up tasks for DBpedia & DBpedia Spotlight so, go ahead and show us your skills.

4 GSoC-2014 DBpedia Ideas

4.1 Wikimedia Commons extraction

Wikimedia Commons ( is a media file repository making available public domain and freely-licensed educational media content (images, sound and video clips) to everyone, in their own language. It acts as a common repository for the various projects of the Wikimedia Foundation. 
The goal of this task is to get as much (good) information as possible from Commons. This includes identification of media types, media metadata, licence information and multilingual descriptions.
The candidate should also make any necessary changes to handle Commons extraction from the mappings wiki (, similar to what is currently happening for Wikipedia Infoboxes, and reuse the existing statistics report ( to track Commons templates mapping. 
Mentors: Dimitris Kontokostas, Andrea Di Menna (co-mentor), Marco Fossati (co-mentor)

4.2 Crowdsourcing Tests and Rules

The extraction framework needs further improvement in testing. Right now we have some unit tests in place but they are not enough. The student here will have to setup the mappings wiki in a way where people can write test cases in wiki markup (special templates) and then these tests will be downloaded and run by the framework.
Mentors: Dimitris Kontokostas, (Jona Christopher Sahnwaldt?) ... read more

4.3 Extraction using MapReduce

DBpedia is a useful framework for extracting various types of structured data from Wikipedia. However, it is not meant to run in a distributed environment such as Hadoop, which is often the environment of choice for running data information extraction and knowledge management systems at scale in a batch or streaming fashion. In such a context, the goal of this project is to port the DBpedia Extraction framework onto Hadoop. Note: Since access to a running Hadoop instance is critical for this project, it would be best for the candidate to have access to her own local resources (e.g. university). DBpedia might provide a limited access to Amazon AWS when it gets a student for this project though.
Mentors: Dimitris Kontokostas, Andrea Di Menna, Sang Venkatraman, Nicolas Torzec Read more...

4.4 Mappings web editor

Create a visual editor for the mappings wiki based on webProtege (
We made a few attempts in the past to create external visual tools for editing the mapping but we failed to maintain them due to MediaWiki/templates changes and dependencies. With this project you will have to enable webProtege to use the mappings wiki in order to store and load mappings and edit the ontology. 
Mentors: Alexandru Todor

4.5 Wikidata (tighter) extraction integration

With GSoC 2013 we made a first draft of the Wikidata integration. This year we plan to continue with more fine grained extraction. The plan is to create separate mappings for each Wikidata property in order to extract claim information, e.g. start/end date for “head of state” in This requires to create the proper template definition in the mappings wiki and then use it in the extraction process.
Mentors: Dimitris Kontokostas

4.6 Automated Wikidata mappings to DBpedia ontology

In this project, the candidate is expected to automatically produce the following datasets that map the descriptors of Wikidata Items [1] to the DBpedia ontology [2]:

  • Equivalent properties
  • Equivalent classes

The datasets must contain assertions using the equivalentClass [3] and equivalentProperty [4] built-in OWL properties respectively, as per the examples below:

The datasets will be loaded on the mappings wiki and the community can refine them from then on.
Equivalent properties dataset
A full Wikidata Property summary table can be found in [5] and will serve as input. Each Property is described with some metadata that represent useful features, since most of them also exist within DBpedia ontology properties.
Equivalent classes dataset
It seems that no namespace is dedicated to Item classes in Wikidata, while it is for properties (see example above). Hence, Wikidata may be treating classes as individuals, in a fashion comparable to OWL Full [6].
The candidate should first investigate reliable ways to extract classes from Wikidata and decide whether to use them as input.
A set of candidate classes is already available in [7] and was produced by querying all the unique values of the instance of property [8].
Otherwise, we assume the input will be directly derived from the property summary table headers [5].
The goal can be achieved via traditional statistical/probabilistic methods for automatic classification, implementing supervised learning models such as Support Vector Machines [9] or Naive Bayes [10]. Methods may range from Levenshtein distance [11] to string kernels [12], all the way to feature vector-based algorithms [13]. The feature set should include at least the label, the aliases and the domain of an item. Such metadata may be retrieved from a Wikidata property discussion page, e.g. [14], but it does not apply for all properties.
Mentors: Marco Fossati

4.7 Clean DBpedia datasets and import in Wikidata

The student who will take this task will be responsible for two things:
Clean up the DBpedia errors based on the output of Databugger ( or similar. With this the student will generate a more sparse but cleaner dump of DBpedia that will be of general use.
Communicate with the Wikidata community in order to coordinate the import of (parts of) the cleaned datasets and re-use the connections of DBpedia to fetch additional data for Wikidata import.
Mentors: Dimitris Kontokostas, Magnus Knuth (co-mentor)

4.8 Ontology consistency check

From previous GSoC ideas[..]/OntologyCheck?v=l3f
Mentors: Marco Fossati, Magnus Knuth (co-mentor)

4.9 Mappings freshness & Better statistics / reporting tools

Template statistics are currently created on-demand by developers, typically when a new DBpedia is about to be released. 
Editors use template statistics to know which templates need to be mapped and which would be the greatest contribution in terms of number of generated statements. 
An automatic and periodic template stats generation process could greatly improve mappings freshness and completeness.
Templates in Wikipedia tend to change over time, possibly leading to outdated mappings in DBpedia. Sometimes template properties are simply renamed but they can also be added/removed.
Currently, the mappings server show which mapped properties are not found in the actual usage of a template in Wikipedia, but there is no notification system which alerts editors about these inconsistencies.
It would be useful to inform editors about changes in templates definition so that mappings could be updated and the DBpedia output could be re-aligned to the current Wikipedia status.
Wikipedia uses many stub templates to mark articles down for review. Some of those stub templates provide semi-structured information which can be leveraged to populate ontology properties (e.g. instance type, nationality, etc.). 
Since the template statistics build process excludes those templates, as they do not meet a specific “property ratio rule” (used to ignore templates which generally do not convey meaningful information), they do not appear in the template statistics and are mostly ignored by editors. 
Hence it is crucial to let the framework recognize stub templates and produce statistics for those as well to increase the quality of DBpedia.
Mentors: Dimitris Kontokostas, Andrea Di Menna (co-mentor), Alexandru Todor (co-mentor)

4.10 Linking to external multimedia data sources

The Flickr Wrappr [1] is a script that links a given DBpedia resource to a set of related pictures from Flickr. It basically leverages a third party API to enrich a resource with data other than text and hypertext.
According to [1, 2], it seems the code is a few years old and is not maintained. However, linking to external multimedia data sources can be a dramatic added value for DBpedia.
As warmup tasks, the candidate is first required to:
Refactor the Flickr Wrappr code [2]
Migrate it to a more visible location in the DBpedia codebase [3]
Then, the candidate must define a set of third party multimedia content providers APIs that have a satisfactory coverage of DBpedia data. He or she must carefully investigate both their terms of use and the licences for data publishing. The selected providers should serve audio, video and photo content. Just to name a few, audio can be retrieved from Bandcamp [4], Grooveshark [5], Rdio [6] or Soundcloud [7], and video from IMDB [8], Rotten Tomatoes [9], Vimeo [10] or YouTube [11].
As the final objective, the candidate is expected to extend the Flickr Wrappr capabilities in order to handle calls to the candidate API set.
[1] http://wifo5-03.informatik.uni[..]
Mentors: Marco Fossati, Magnus Knuth (co-mentor)

4.11 New DBpedia Interfaces: Resource Widgets

To enhance the visibility of the DBpedia project it would be desirable to ease the use of DBpedia from non semantic-web expert users. A part to that direction is the creation easily configurable snippets (warmup task –[..]framework/issues/177) that users can embed in their websites. An additional goal is to provide something similar to the Freebase suggest widget ( This can be a jQuery module that can be reused in web development. An additional task of this idea is to wrap this functionality in a Drupal and/or Wordpress plugin to further disseminate DBpedia.
Mentors: Magnus Knuth, Dimitris Kontokostas, Patrick Westphal, Andrea Di Menna (co-mentor)

4.12 Scalable mapping development for DBpedia Wiktionary

Wiktionary is a large-scale, multilingual, crowd-sourced dictionary. It features 18,689,141 articles in 171 languages maintained by 4184 active users. Dictionary entries may contain definitions and examples, part of speech, idioms and proverbs, synonyms, antonyms, hyperonyms and hyponyms, related terms, phonological information in IPA notation or as soundfile, word formation, flexion tables, etymology, images, as well as translations into other languages. Wiktionary is an invaluable source of dictionary data.

To make further use of the data, it should to be transferred from its current semi-structured document format to a semantic data format like RDF. This can be achieved by already existing transformation software [2] maintained by the DBpedia project. However, the structure of every single language Wiktionary is different. Articles contain a varying degree of information in varying forms. That's why the conversion software allows for mapping the structure of Wiktionary articles to the final RDF structure via custom mappings. At the moment, these mappings exist for English, German, French, Russian, Greek, Vietnamese. This means that 165 language mappings representing over 60% of the articles are still missing.

Mappings are written in XML, using a simple regular expression syntax to match the wiki markup. Up to this point, they were developed by native speakers that are also versed in XML and programming.

To make the mapping approach more scaleable and allow for better maintenance of existing mappings, the student responsible for this task needs to develop a system that allows for easy mapping and taking into account the diverse languages. This system might be a community project like a mapping wiki, a mapping pipeline, a GUI or a combination thereof. As proof of concept, a few new mappings, especially in the European languages, should also be developed.

Mentors: Kyungtae Lim, Jim O’Regan (co-mentor)

4.13 Abbreviation Base – A multilingual knowledge base for abbreviations

Abbreviations are an ubiquitous part of everyday life. They are also an omnipresent part of natural language texts, where they present an interesting problem for tokenization, sentence boundary detection and disambiguation purposes. Collection of abbreviations and their meanings in a knowledge base is an attractive use-case for aggregation of DBpedia data.

Efforts have already been made to extract abbreviation lists from the English, German and Dutch DBpedias via a collection of simple Python and Shell scripts. The data model has also already been developed. The student responsible for this task would improve upon the existing resources and use them to collect abbreviation data from all 119 languages in DBpedia and publish it as Linked Open Data.
Mentors: Martin Brümmer

4.14 Support for synchronic digraphia

There are a number of languages on Wikipedia that support more than one writing system (e.g. Serbo-Croatian, Kurdish, Kazakh etc.). The issue of digraphic Wikipedias is best illustrated in the case of information retrieval. Online communities often rely on a single character set for communication/interaction on the Web, even when the language they communicate in is officially digraphic. That means a large portion of the information available online is (and, often, expected to be) encoded in this one script. Now, imagine an important piece of information being encoded in the other script. Unfortunately, unless the information retrieval software performs transliteration (translation from one script to another) on-the-fly (at retrieval time), many attempts at information extraction will be doomed to fail (as no match will be found). This directly affects common tasks such as keyword search, label-based (SPARQL) querying, named entity recognition (e.g. DBpedia Spotlight), etc.

As it may be unrealistic to impose this requirement on the software developers, the only reasonable, yet, perhaps, not so elegant workaround is to have the knowledge base keep the information encoded in all possible character sets (Kazakh Wikipedia edition appears to be trigraphic). Although such an approach would, in the best case scenario, double the space requirements needed for storing any string literal, there is also the matter of perspective – one could argue that although the information being stored is essentially the same, the very fact that different character sequences are needed to describe the same piece of knowledge makes this problem fall into the domain of multilingualism.

As the current state of the DBpedia Extraction Framework doesn’t take this problem into consideration, we’re in need of an extension that would do so. The general idea is to keep a single IRI, but have two or more separate triples per string literal, taking into consideration all MediaWiki syntax constructs and magic words that control the transliterator’s behavior (e.g. not all strings are to be transliterated, not all transliteration rules are straightforward etc.). The extension should be flexible/configurable to adapt to multiple languages.
Mentors: Uroš Milošević

4.15 Tools to enhance the quality of DBpedia (and DBpedia chapters) data

Development of heuristics to measure and enhance the quality of the DBpedia (and DBpedia chapters) data. Examples of these heuristics could be:
1) The extraction process generates a log file with the difficulties found, essentially syntactic errors in Wikipedia entries. The proposed tool would modify the extraction code to provide a more portable (RDF, XML) format to these suspicious warnings. Also could send them to collaborative curation tools such as PatchR (
2) Exploitation of cross-lingual information. Many data are n-plicated in several languages. For instance, the height of Everest, or the population of a city like Madrid. A tool could exploit this redundant information to check consistency. Let us say that 5 chapters say that Everest height is 8848 m, if a given chapter say it is 8847 this could be a hint to detect an error.
3) Databases information checker. Let us say that we have a list, or a database, with high confidence information (e.g. information from a Ministry). A tool could check that the terms/values in the list (or DB) match the values stored in DBpedia (or a given DBpedia chapter). This could help to find wrong information and missing information.
Mentors: Mariano Rico, Magnus Knuth, Alexandru Todor (co-mentor)

4.16 Pattern Discovery and Knowledge Base Completion

By finding patterns of axioms (e.g., resources in the category “German Scientists” always have nationality.{German}), the DBpedia knowledge base can be completed even for resources that do not have a mapped infobox. The goal is to find significant patterns and execute them for finding new axioms, maybe even in an interactive process. In a second step, axioms that have strong counter-evidence in the patterns might even be discarded and/or corrected.
Mentors: Heiko Paulheim, Marco Fossati (co-mentor), Magnus Knuth (co-mentor)

4.17 Natural language question answering engine

Structured data in DBpedia allow users to ask incredibly fine-grained questions like "Who are the composers of black and white movie soundtracks born in British towns with less than 30k inhabitants?". This is made possible thanks to SPARQL. However, the complexity of such a query language often requires too much expertise and prevents average users from effectively querying the knowledge base.
Lots of research efforts have been carried out in this direction [1, 2, 3, 4]. The main purpose is to process a natural language query through NLP techniques and to transform it into SPARQL.
The goal of this project is the implementation of an official DBpedia question answering engine. The successful candidate is expected to avoid duplicate development by porting already available code from existing research projects to the DBpedia codebase, thus publishing it in an open source fashion. The Quepy Python library [5] may be an interesting starting point.
The first mandatory milestone of the engine must support the English language (i.e. international DBpedia) and handle a set of simple questions such as "Who is the president of France?".
Ideally, the engine should be designed with a language-agnostic paradigm, in order to enable easy plug-in of non-English linguistic resources. In this way, all the local DBpedia chapters can potentially benefit from it. Most recently, the TOOSO project [6] goes deep into this direction. It is built from the ground up on two very bold and specific assumptions: human languages generate complex syntactic structures via the application of simple computational operations and that these structures are naturally adapting to referential models of reality. The successful candidate will also be included in the beta testers list.
Mentors: Marco Fossati

4.18 Fine-grained massive extraction of Wikipedia content

With the advent of the Wikidata second phase [1], semi-structured data coming from Wikipedia infoboxes are being migrated to Wikidata repositories, which will lately serve as the central backbone for automatic infobox population. Under this perspective, we aim at experimenting ways to enlarge the DBpedia extraction capabilities, taking into account further unstructured or semi-structured data appearing in Wikipedia articles. For instance, given the article in [2], we want to extract the discography section, as well as the pictures and the songs, which currently do not exist in the corresponding DBpedia resource [3].
The purpose of this project is to investigate approaches for easily plugging-in new extractors in a modular and lightweight fashion.
Solutions include (but are not limited to) the implementation of a graphical interface to write custom converters for JSONpedia [4]. JSONpedia is an API which enables full accessibility to Wikimedia deployments content. It converts Wikimedia information to JSON and enables the creation of scriptable data converters. This potentially allows to transform any Wikipedia article section to custom JSON data, that can be finally serialized into RDF.
The web interface will be an Ajax-based REST console meant to write, test and store converters (i.e. scripts), in a way similar to the classic ScraperWiki platform [5] (see for example [6]). The interface will help to verify a converter code over a small set of articles containing the data the converter is written for.
The Converter Editor main features should include:

  • Write Python scripts with syntax highlighting (porting existing projects such as [7])
  • Test them over a set of templates
  • View their execution result
  • Save them associating a description and tags
  • Lookup existing scripts, filtering by description and tags
  • Re-edit / clone existing scripts

Mentors: Marco Fossati, Michele Mostarda (co-mentor)

5 GSoC-2014 DBpedia Spotlight Ideas

For now are ideas are here:[..]nlLIK3eyFLd7DsI/edit#

As soon as the document is ready we will port it in this page


  • Streamline indexing pipelines: Have single-machine and pignlproc producing the same output formats. Have classes converting between the two formats. Have classes converting from our TSV to training data for OpenNLP, ClearNLP, etc.
  • Use ClearNLP instead of OpenNLP in our pipelines. Start from a ClearnNLP spotter using their NER. Train their NER with our data and compare with their current NER model. Include best model as a spotter. Create new spotters and disambiguators using ClearNLP.

6 Mentors

moved to

7 More Information


DBpedia Spotlight