Google Summer of Code 2013 / joint proposal for DBpedia and DBpedia Spotlight

Almost every major Web company has now announced their work on a knowledge graph, including Google’s Knowledge Graph, Yahoo!’s Web of Objects, Walmart Lab’s Social Genome, Microsoft's Satori Graph / Bing Snapshots and Facebook’s Entity Graph.


DBpedia is a community-run project that has been working on a free, open-source knowledge graph since 2006. DBpedia currently exists in 97 different languages, and is interlinked with many other databases (e.g. Freebase, New York Times, CIA Factbook) and hopefully, with this GSoC to Wikidata, too. The knowledge in DBpedia is exposed through a set of technologies called Linked Data. Linked Data has been revolutionizing the way applications interact with the Web. While the Web2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages.

One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex querie with SQL-like query languages (e.g. SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.

This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. This is a very simple way to connect “unstructured” text to a structure (hierarchy of tags). For more advanced examples, see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Identifiers and data provided by DBpedia were greatly involved in creating this knowledge graph. Or, more recently, did you see that IBM's Watson used DBpedia data to win the Jeopardy challenge?

DBpedia Spotlight is an open source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking 3.64 million “things” of 320 different “types” with over half a billion “facts” (July 2011).

After a successful GSoC2012 with DBpedia Spotlight, this year we join forces with the DBpedia Extraction Framework and other DBpedia-family products. We got excited with our new ideas, we hope you will get excited too!

1 Steps for candidate students

If you are a GSoC student who want to apply to our Organization, here's a rough guideline on the steps to follow:

  • Subscribe to the DBpedia-GSoC mailing list. All GSoC related questions (ideas, proposals, technical, etc) must go through this list. This it easier for you to search through the archives and for us to follow the discussion.
  • Introduce yourself in the list.
  • Read carefully all the ideas we propose and see if any of these suits you. Also note that you can also submit your own idea.
  • The final goal in your proposal is to convince us that you understood how you will handle this and to have a specific code-plan so, get as much information as possible for the ideas you like. To do this you can search in the GSoC Archives or ask questions to the GSoC mailing list. Please send a separate mail for each idea question to make it easier for other students to follow.
    • Once you get help, a nice-to-do would be to add the archive thread link back to the idea page. This will reduce the mentor's effort to repeat themselves and let them focus on giving great answers.
  • Work on some of the Warm-Up task we suggest.
  • Write your proposal.
  • For GSoC related queries you should look at the Google-Melange help page and the student guide they prepared.

2 Guidelines

As a general rule, we will treat the money Google is going to give us as if it we would have to pay it ourselves. Therefore you should aim in your proposal to 

  1. convince all mentors that your proposal is worth receiving the money
  2. argue the benefit of your proposal for the DBpedia + Spotlight project

3 Warm Up tasks

These are tasks that potential students might want to try in order to (1) get a feeling for the code (2) learn initial skills (3) get in contact with the community (4) earn an initial good standing.

We already prepared a few warm-up tasks for DBpedia & DBpedia Spotlight so, go ahead and show us your skills.

4 GSoC-2013 DBpedia Ideas

4.1 Wikidata + DBpedia

Wikidata is more and more replacing the infoboxes in Wikipedia. This is a great chance for DBpedia to innovate as it frees up resource previously bound to scraping infoboxes (with a lot of potential errors and a lot of effort). Now, we want to use the data provided by Wikidata and also help the project to become the editing interface for DBpedia. And use of fire safes for data.
Students applying for this topic should first address how they are going to solve the prerequisite in a professional sustainable way. Then they should choose one of the remaining high-level topics and elaborate on this.
Mentors: Dimitris, Sebastian

4.2 Extend infobox mapping

This idea is not valid anymore, We already had external contributions for these issues and now misses the prospect of a full project. All these issues are now warm-up tasks.
Extend mapping syntax with the following: a) construct object by adding prefixes / suffixes (issue #20,), b) Extend conditional mappings to more complex conditions (issue #19), c) Option generate invert properties (issue #32), e) proper handling of multiple templates in a page (issue #17), d) map categories to classes (issue #21). Finally, add a new extractor that will be the combination of infobox & mapping extractor and only produce triples in the property namespace if they they are not mapped (issue #22). All the aforementioned extensions can be considered of similar difficulty. Depending on the work load (d) could be omitted from the deliverables.
Mentors: Dimitris, Marco

4.3 Continuous extraction

Make the extraction framework always check for an available new dump, download it, extract it and update the static version of DBpedia.
Currently, a new DBpedia release involves a lot of manual work: there are many different scripts and processes that need to be run in a certain order, different spots where configuration arguments must be set – often using complicated syntax – and there are hundreds of things that might go wrong and can only be fixed by human intervention. In the past, this meant that preparing a new DBpedia release took several weeks, if not months.
The goal doesn’t necessarily have to be complete automization. A significant reduction of complexity and manual effort would also be a great improvement. This can also be combined with the Wiktionary2RDF extraction as well as continuous extraction from other wikis by WikiMedia.
Mentors: Dimitris, Christopher, Sebastian

4.4 I18n data fusion

Merge data from across different DBpedia / Wikipedia language editions. See this paper for more details.
Mentors: Pablo, Volha

4.5 Type inference to extend coverage

There were a number of papers published recently that use different features (e.g. categories, infoboxes, abstracts, other DBpedia properties, etc.) to infer the type of an entity, and therefore extend the coverage of the ontology. There was one in WoLE2012 by Aleksander Pohl, one in ISWC2012 by Aldo Gangemi (Tipalo), one upcoming in ESWC2013 by U. Trento, etc.
Also, in the dbpedia-discussion list, Christopher has pointed out that we can use the presence of coordinates (geolocation) to automatically infer that something could be a location. Also, Freebase has more types than DBpedia. Creating a clean pipeline that imports all typing information for DBpedia would be extremely useful.
Mentors: Marco, Pablo, Dimitris

4.6 Design a better / interactive display page

The general idea is to improve the DBpedia resource display page and make it more interactive. For instance, this is what we get for 'Presidency_of_Barack_Obama' in DBpedia, this is what Freebase shows and this is what others show with just DBpedia data: semanticreportsfluidops & graphite. The new interface must work for all I18n DBpedia editions, display by default the browser's language (for labels) and maybe fetch on the fly data from other language editions or other datasets.
We would really like to handle information inserted into Wikipedia via User Script or similar, as well. 
Mentor: Dimitris

4.7 Ontology consistency check

As we all know, given the Wiki nature of the DBpedia ontology, anyone can edit it and add/delete/change its classes and properties. This has led to a ‘chaotic’ conceptual structure, which is in complete contrast with the idea behind an ontology, i.e., to provide consistency to the heterogeneous data coming from the different Wikipedia chapters.
The recently proposed automatic approaches for type inference (A. Pohl at WoLE 2012, A. Gangemi at ISWC 2012, A. Aprosio at ESWC 2013) can be a valid starting point to clean the current state of the ontology, at least for the classes hierarchy.
Also, they can be used as a tool to prevent redundancy, i.e., to alert a human contributor when he or she is trying to add some new class or property that is already out there under a similar name.
Mentors: Marco, Julien

4.8 Wiktionary2RDF Assistance GUI (also any other MediaWiki, e.g. TravelWiki)

The recently created generic extractor for Wiktionary ( can be configured to basically extract RDF from very heterogeneously structured MediaWiki deployments. The goal is that you do not need a developer with Scala+Maven+Git skills to extract triples, but that Wiki users can create such configurations with the help of a GUI. The Wiktionary community might be interested to help testing such a GUI and give feedback (We would still need to ask there, however)
Mentors: Sebastian

4.9 Massive extraction of triples from MediaWikis

The recently created generic extractor for Wiktionary can be configured to basically extract RDF from very heterogeneously structured MediaWiki deployments. The goal of this task is to extract useful triples from as many MediaWiki deployments: (Might be done by brute force writing of configurations, for the existing extractor, or maybe machine learning?)
Mentors: Sebastian

4.10 Crowd-source tests and extraction rules

Extend the DBpedia mappings wiki and the code that accesses it such that the community can also contribute automated extraction tests, data value extraction rules, and ontology data types, in addition to template mapping rules and ontology classes and properties.
DBpedia tries to extract high-quality data from Wikipedia text pages that are filled with dozens or hundreds of different formats and units for physical properties like height, depth, or weight, financial or sociological data like income, population density and so on, and – maybe worst of all – calendar dates.
Mentors: Christopher, Dimitris

4.11 Interface / Power tool for DBpedia testing metadata

This idea sounds similar to 2.10 but tackles the problem from another perspective. 
We are now building a new framework for debugging DBpedia data with sparql queries. 
This is very new and don't have anything concrete to showcase yet (we hope in a couple of weeks) but what we are going to do is the following:

  1. We are creating various "test SPARQL queries" (e.g. people with birthDate after their deathDate).
  2. We are building a framework that will run all of them against DBpedia (English, I18n, Wikidata) and 
  3. enriching the error results with metadata and storing them in a triple store.

The task behind this idea is to create an interface for end users where they can browse the errors, administrators can change the error metadata (like a power tool), and generate visual statistics between different test periods.

5 GSoC-2013 DBpedia Spotlight Ideas

5.1 Generalize input formats and add support for Google mention corpus

The indexing pipeline can be extended to use other input formats than the Wikipedia dump. You would create this feature by generalizing from the Wikipedia input format and add an input format for the new Google mention corpus
Mentors: Pablo, Max, Jo

5.2 Efficient graph-based disambiguation and general performance improvements

One of the results of last year’s GSoC is an implementation of graph-based disambiguation. You would integrate this implementation with the existing database-backed back-end in an efficient manner. Further, you would benchmark and optimize the general annotation time performance.
Mentors: Pablo, Max, Jo

5.3 Extract the necessary DBpedia data directly from the Wikipedia dump

For creating Spotlight models, we need instance_types.nt, redirects.nt and disambiguations.nt. Since we want these to be from the same Wikipedia dump as the one from which we create the model, integrate the DBpedia extraction into the script in DBpedia Spotlight, so that the files are automatically produced during indexing.

6 Prerequisites

Soft skills:

  • We would like to work with people that are energetic programmers, passionate about open source, and really interested in the topics around DBpedia & DBpedia Spotlight. You don't need to worry about convincing us about this. We can tell from how much preparation went into your proposal.
  • Although the mentors are here to help you, we expect you to be able to search and find answers for most questions for yourself. Search engines like easy questions, mentors like the tough ones. When you ask a question, show that you've looked for the answer before asking.

Programming languages we love:

  • Java: we love cross-platform code and object oriented programming.
  • Scala: adds functional programming to the Java world, and in our opinion allows one to write more concise code, and write it in less time.
  • Linux/Bash: a lot of common tasks can be done with cat/sort/uniq/grep/sed/cut. We use them every day.
  • Python: we commonly write scripts in python for quick, small tasks.
  • R: very convenient for analyzing your data and looking into anything that involves statistics.
  • Your language?: Depending on the project we open to other language suggestions, just convince us;)

You don't need to know all of them. Solid knowledge in Java/Scala is enough for most of what we do. Our build process is based on Maven2.

7 Mentors (alphabetically)

moved to

8 More Information


DBpedia Spotlight