Google Summer of Code 2015 / joint proposal for DBpedia and DBpedia Spotlight
- 1 Introduction
- 2 Steps for Candidate Students
- 3 Guidelines
- 4 Warm Up tasks
- 5 GSoC 2015 DBpedia Ideas
- 5.1 Fact Extraction from Wikipedia Text
- 5.2 New Dynamic Extractors from Wikipedia Content with JSONpedia Faceted Browsing
- 5.3 Parallel processing in DBpedia extraction Framework
- 5.4 Mappings freshness & Better statistics / reporting tools
- 5.5 Improved Mapping Support for the Mappings Wiki
- 5.6 DBpedia Data Error Reporting Tool
- 5.7 Reverse Engineering and Aligning Freebase with DBpedia
- 5.8 DBpedia Live scaling & new interface
- 5.9 Keyword Search on DBpedia
- 5.10 DBpedia Metadata Datasets
- 5.11 DBpedia Schema Enrichment on Web Protege
- 5.12 Deploying a DBpedia Question Answering Engine
- 5.13 Aligning Life-Science Ontologies to DBpedia
- 5.14 Scalable querying of the live DBpedia data stream
- 5.15 DBpedia Spotlight – Better Context Vectors
- 5.16 DBpedia Spotlight – Better Surface Form Matching
- 5.17 DBpedia Spotlight – Better Tools for Model Creation
- 5.18 DBpedia Spotlight – Model Editor/Domain Adaptation tool [not a priority]
- 5.19 DBpedia Spotlight – Confidence/Relevance Scores
- 6 Mentors
- 7 More Information
Almost every major Web company has now announced their work on a Knowledge Graph, including Google’s Knowledge Graph, Yahoo!’s Web of Objects, Walmart Lab’s Social Genome, Microsoft's Satori Graph / Bing Snapshots and Facebook’s Entity Graph.
DBpedia is a community-run project that has been working on a free, open-source knowledge graph since 2006!
DBpedia currently describes 38.3 million “things” of 685 different “types” in 125 languages, with over 3 billion “facts” (September 2014). It is interlinked to many other databases (e.g., Freebase, Wikidata, New York Times, CIA Factbook). The knowledge in DBpedia is exposed through a set of technologies called Linked Data. Linked Data has been revolutionizing the way applications interact with the Web. While the Web2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages.
One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (e.g., SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.
This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. This is a very simple way to connect unstructured text to a structure (hierarchy of tags). For more advanced examples, see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Identifiers and data provided by DBpedia were greatly involved in creating this knowledge graph. Or, more recently, did you see that IBM's Watson used DBpedia data to win the Jeopardy challenge?
DBpedia Spotlight is an open source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking about millions of things and hundreds of types.
We are regularly growing our community through GSoC and can deliver more and more opportunities to you.
We got excited with our new ideas, we hope you will get excited too!
2 Steps for Candidate Students
If you are a GSoC student who want to apply to our organization, here is a rough guideline on the steps to follow:
- Subscribe to the DBpedia-GSoC mailing list. All GSoC related questions (ideas, proposals, technical, etc) must go through this list. This it easier for you to search through the archives and for us to follow the discussion.
- Introduce yourself in the list.
- Read carefully all the ideas we propose and see if any of these suits you. Also note that you can also submit your own idea.
- The final goal in your proposal is to convince us that you understood how you will handle this and to have a specific code-plan so, get as much information as possible for the ideas you like. To do this you can search in the GSoC Archives or ask questions to the GSoC mailing list. Please send a separate mail for each idea question to make it easier for other students to follow.
- Once you get help, a nice-to-do would be to add the archive thread link back to the idea page. This will reduce the mentor's effort to repeat themselves and let them focus on giving great answers.
- Work on some of the Warm-Up task we suggest.
- Write your proposal.
- For GSoC related queries you should look at the Google Summer of Code 2015 FAQs and the student guide they prepared.
As a general rule, we will treat the money Google is going to give us as if we would have to pay it ourselves. Therefore you should aim in your proposal to
- convince all mentors that your proposal is worth receiving the money
- argue the benefit of your proposal for the DBpedia + DBpedia Spotlight project
4 Warm Up tasks
These are tasks that potential students might want to try in order to (1) get a feeling for the code, (2) learn initial skills, (3) get in contact with the community, and (4) earn an initial good standing.
5 GSoC 2015 DBpedia Ideas
5.1 Fact Extraction from Wikipedia Text
The DBpedia Extraction Framework is pretty much mature when dealing with Wikipedia semi-structured content like infoboxes, links and categories.
However, unstructured content (typically text) plays the most crucial role, due to the amount of knowledge it can deliver, and few efforts have been carried out to extract structured data out of it.
For instance, given the article in , we want to extract a set of meaningful facts and structure them in machine-readable statements. The following sentence:
“In Euro 1992, Germany reached the final, but lost 0–2 to Denmark”
<Germany, defeat, Denmark>
<defeat, score, 0-2>
<defeat, winner, Denmark>
<defeat, competition, Euro 1992>
This project will:
- Apply Natural Language Processing techniques to the Wikipedia textual corpus;
- Harvest structured facts out of it;
- Automatically populate DBpedia with novel statements.
In other words, the main objective is the implementation of a new text extractor.
It will take as input a Wikipedia corpus and perform the following steps:
- Verb extraction and ranking
- Frame Classifier Training
- Frame Extraction
The linguistic theory behind this is Frame Semantics, with FrameNet being the most mature implementation .
The construction of the training set will be carried out with a crowdsourcing approach as per [3, 4].
 Baker, C.F.: Framenet: A knowledge base for natural language processing. ACL 2014 – http://www.aclweb.org/anthology/W/W14/W14-3001.pdf
 Fossati, M., Giuliano, C., Tonelli, S.: Outsourcing FrameNet to the Crowd. ACL 2013 – http://www.aclweb.org/anthology/P13-2130
 Fossati, M., Tonelli, S., Giuliano, C.: Frame Semantics Annotation Made Easy with DBpedia. Crowdsourcing the Semantic Web at ISWC 2013 – http://ceur-ws.org/Vol-1030/paper-03.pdf
Tags: Python, Natural Language Processing, Machine Learning, Relation Extraction, Frame Semantics
Mentors: Marco, Michele, Dinesh
5.2 New Dynamic Extractors from Wikipedia Content with JSONpedia Faceted Browsing
DBpedia provides solid extractors that handle Wikipedia semi-structured content like infoboxes, links and categories.
However, a huge amount of knowledge can still be harvested from the body of Wikipedia articles.
In this project, we aim at enlarging the DBpedia extraction capabilities by taking into account further unstructured data. For instance, given the article in , we want to extract the discography section, as well as the pictures and the songs, which currently do not exist in the corresponding DBpedia resource .
In GSoC 2014 , we implemented a faceted data store on top of JSONpedia [4, 5] and wrote an experimental plugin for the DBpedia Extraction Framework.
This year, the main objective is to write new extractors that consume the JSONpedia faceted data store.
JSONpedia is a framework designed to simplify the access to MediaWiki contents by transforming everything into JSON. It provides a Java library, a REST service and command line tools to parse, convert, enrich and store WikiText documents. The converted JSON documents are stored into ElasticSearch, which enables advanced faceted browsing support and makes JSONpedia a massive data scraping facility.
Tags: ElasticSearch, Groovy/Python, Faceted Browser, Dynamic Extractors
Mentors: Michele, Marco
5.3 Parallel processing in DBpedia extraction Framework
In last year’s GSoC we had a successful project that ported DBpedia extraction framework to Apache Spark. This idea is extension of the previous work to refine the project and prepare preconfigured setups to run on AWS, GCloud and Docker images
Tags: Scala, Spark, Hadoop, AWS, GCloud, Docker
Mentors: Dimitris, Alexandru, Sang
5.4 Mappings freshness & Better statistics / reporting tools
Template statistics are currently created on-demand by developers, typically when a new DBpedia is about to be released.
Editors use template statistics to know which templates need to be mapped and which would be the greatest contribution in terms of number of generated statements.
An automatic and periodic template stats generation process could greatly improve mappings freshness and completeness.
Templates in Wikipedia tend to change over time, possibly leading to outdated mappings in DBpedia. Sometimes template properties are simply renamed but they can also be added/removed.
Currently, the mappings server show which mapped properties are not found in the actual usage of a template in Wikipedia, but there is no notification system which alerts editors about these inconsistencies.
It would be useful to inform editors about changes in templates definition so that mappings could be updated and the DBpedia output could be re-aligned to the current Wikipedia status.
Wikipedia uses many stub templates to mark articles down for review. Some of those stub templates provide semi-structured information which can be leveraged to populate ontology properties (e.g. instance type, nationality, etc.).
Since the template statistics build process excludes those templates, as they do not meet a specific “property ratio rule” (used to ignore templates which generally do not convey meaningful information), they do not appear in the template statistics and are mostly ignored by editors.
Hence it is crucial to let the framework recognize stub templates and produce statistics for those as well to increase the quality of DBpedia.
Tags: Java/Scala, PHP, RDF
5.5 Improved Mapping Support for the Mappings Wiki
The DBpedia Mappings Wiki is one of the central components of the DBpedia extraction framework. It consists of a mediawiki instance where users can contribute mapping rules in different languages that associate Wikipedia template properties to DBpedia ontology properties and map Wikipedia Infobox templates to DBpedia ontology classes .
Some of the problems editors face when creating new mappings is the necessity of analysing the existing Wikipedia infobox definition, it’s usage examples, existing mappings of that infobox in other languages as well as looking for existing properties and classes in the DBpedia ontology. This workflow requires opening 5 up to 10 windows simultaneously.
The scope of this project is to create a Mediawiki Extension that blends in information from Wikipedia, the DBpedia Mappings Wiki as well as provides extended search capabilities for the dbpedia ontology into one workbench. Furthermore it should provide extended mapping validation and error reporting capabilities to the user.
 Claus Stadler: Community-Driven Engineering of the DBpedia Infobox Ontology and DBpedia Live Extraction
5.6 DBpedia Data Error Reporting Tool
A common problems of knowledge bases in general and DBpedia in particular is bad data. This can result either from wrong information in Wikipedia or being incorrectly parsed by the DBpedia extraction framework. Currently there is no way to mark extracted data in DBpedia as being incorrect (or missing).
The task is to create a tool and to modify the Linked Data interface of DBpedia  so that particular triples can be marked as incorrect and errors can be reported. Users could be offered the possibility of submitting a bug report integrated with the github issue tracker.
Furthermore, the interface should be integrated with Wikipedia/Wikidata so that users can go to the corresponding Wikipedia article page. Users can then investigate whether the information is wrong in Wikipedia and correct it directly or mark it as an infobox extraction error.
Mentors: Alexandru, Magnus
5.7 Reverse Engineering and Aligning Freebase with DBpedia
Freebase and DBpedia are both crowdsourced based information extraction frameworks. However, due to it’s powerful commercial backing Freebase has arguably better data quality and schema coverage than DBpedia, and has therefore profited by a wider industry adoption. After more than 7 years of existence Google has decided to shut down Freebase, and the data will be transferred to Wikidata, the exact details however remain unknown.
The idea of this project is to align the Freebase Schema and and the DBpedia ontology and to produce a new ontology that conforms to both. This would mean the mapping of existing ontology properties and classes from DBpedia to Freebase and introducing new classes and properties if needed. It also implies a deeper analysis of the instance type assignment in Freebase as well as a triple-level analysis of the extracted data after the alignment has been complete in order to infer differences between the preprocessing and curation steps of the different frameworks.
The goal of this project is to allow tools that are based on freebase to migrate more easily to DBpedia.
Tags: Java/Scala, RDF/OWL, Machine Learning, Freebase
Mentors: Alexandru, Marco
5.8 DBpedia Live scaling & new interface
DBpedia Live offers real-time RDF extraction from Wikipedia. At the moment the extraction keeps a cache of all extracted pages in order to produce proper diffs. This cache is stored in relational MySQL DB. While for other languages, this works pretty well, for English creates a bottleneck. We want to move this cache to a NoSQL DB such as MongoDB.
In addition we would like to produce a new interface where people can see what is currently being extracted.
Tags: NoSQL, (Node)JS, Java/Scala
Mentors: Dimitris, Magnus
5.9 Keyword Search on DBpedia
In last year’s GSoC we have successfully investigated an architecture for question answering on DBpedia based on the TBSL approach . We showed that the approach can be easily ported to Chinese. In this year’s GSoC, we aim to develop a more Google-like (i.e., a scalable keyword-based approach) to search on DBpedia . In particular, we aim to develop an approach that generates interpretations of keyword queries as SPARQL queries, verbalizes them using SPARQL2NL (http://aksw.org/projects/SPARQL2NL)  and verbalizes the results using the entity verbalization approach provided in the SPARQL2NL. The system is to be developed for the English language primarily. The architecture should however foresee extensions to other languages.
 Christina Unger, Lorenz Bühmann, Jens Lehmann, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, and Philipp Cimiano. 2012. Template-based question answering over RDF data. In Proceedings of the 21st international conference on World Wide Web (WWW '12). ACM, New York, NY, USA, 639-648.
 Denis Lukovnikov and Axel-Cyrille Ngonga Ngomo. 2014. SESSA – Keyword-Based Entity Search through Coloured Spreading Activation.Natural Language Interfaces for the Web of Data Workshop at the International Semantic Web Conference (ISWC’ 2014). Best Paper Award.
 Axel-Cyrille Ngonga Ngomo, Lorenz Bühmann, Christina Unger, Jens Lehmann, and Daniel Gerber. 2013. Sorry, I don't speak SPARQL: translating SPARQL queries into natural language. In Proceedings of the 22nd international conference on World Wide Web (WWW '13). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 977-988.
Tags: Java, Information Retrieval, Keyword Search, Scalability, Natural Language Generation
Mentors: Axel, Dinesh
5.10 DBpedia Metadata Datasets
One of the supposed strengths of RDF is the ability to store metadata (e.g. sources for a fact, date of those sources, which extractor was used for extraction etc.). Yet, DBpedia as one of the core Linked Data projects does not make significant use of this (mostly because of shortcomings of RDF itself). In particular, with Wikidata information becoming available, which contains a lot of metadata. In a first stage of the task, information about different annotations models e.g. RDF annotations, OWL annotations, the Wikidata model, the model at http://arxiv.org/pdf/1406.3399.pdf etc. should be gathered. The main task is then to provide an extension of the DBpedia extraction framework which allows to extract this metadata in one selected format.
Tags: annotations, metadata, Wikidata
Mentors: Jens, Dimitris
5.11 DBpedia Schema Enrichment on Web Protege
DBpedia is constructed using a crowd sourcing approach in which Wikipedia structures are mapped to the DBpedia ontology. The choice of mapping is partially subjective to user and the mapping is flexible in that it allows users to create new properties. As a result, properties with the same semantics are represented using different constructs, which in turn makes querying and reusing DBpedia more difficult. Therefore, there is a need to provide editors with more powerful tools for managing and enriching the DBpedia ontology.
This will partially be facilitated by DBpedia moving to Web Protege . Using Web Protégé, users have richer functionality allowing to maintain a cleaner ontology structure. This approach will require only limited changes to other parts of the DBpedia architecture .
Nevertheless, Web Protégé itself provides only limited support for automatically detecting problems and enriching the ontology. In this task, the idea is to take DL-Learner , a data and schema analysis tool, and implement a plugin for Web Protégé. DL-Learner will provide suggestions to mapping editors on how the ontology could be extended and can also pinpoint particular problems. With only limited effort necessary, this will integrate state-of-the-art machine learning algorithms into the DBpedia community process to significantly improve the quality and ease of access to DBpedia. A similar DL-Learner plugin for the regular Protégé version exists already [4,5], but integrating this into a web application framework and adapting it to the DBpedia workflow requires further engineering and implementation effort.
Mentors: Jens, Lorenz
5.12 Deploying a DBpedia Question Answering Engine
Follow-up on the QA work from earlier years: http://wiki.dbpedia.org/gsoc2014/ideas#h359-22
The main goal in this task is to actually deploy a question answering engine for DBpedia, implement feedback elements and connect this to the DBpedia community. Currently, most (or all) deployments are scientific prototypes whereas here we want to deploy an engine which is highly available, responsive and up-to-date. Access to server hardware for this task will be provided by the mentors.
Tags: Question Answering
Mentors: Jens, Axel
5.13 Aligning Life-Science Ontologies to DBpedia
The goal in this project would be to align DBpedia better to life science ontologies. The task will consist of a) improving the extraction framework for life science data, b) interlinking this to various life science ontologies and c) use the links to validate schema and instance data on both sides.
Mentors: Jens, Patrick, Robert
5.14 Scalable querying of the live DBpedia data stream
Every time something is updated in Wikipedia’s infoboxes, new triples are generated for DBpedia Live. It can take some time, however, until those triples are available for querying: official datasets are released roughly every year. Your goal in this project is to make DBpedia Live instantaneously available through the Triple Pattern Fragments interface. The current interface (fragments.dbpedia.org) runs on top of a binary file format that needs complete regeneration at every update. Can you make this faster and shorten the deployment cycle, either by algorithmic design or by an intelligent streaming architecture? And can you implement this in a programming language of your choice? Your work will allow anybody in the world to query the latest news as it happens!
Tags: Scalability, Stream Processing, Indexing, Querying
Mentors: Ruben, Dimitris
5.15 DBpedia Spotlight – Better Context Vectors
Benchmarking DBpedia Spotlight against some datasets indicates that it faces some problems with regards to disambiguation. At the moment, the context of an entity is represented by simple counts of the terms around links to that entity. It would be interesting to investigate the following:
– Does smoothing/pruning offer a significant improvement on Spotlight's performance.
– What distributional methods can be used to represent context (e.g. word2vec / Glove)? Do they offer a significant performance improvement ?
– Is there any other metric to intermediate the measured similarity between entity candidates and the context around the mention  ?
Tags: Scala, Java, Natural Language Processing, Machine Learning, Entity Context
Mentors: Joachim, Pablo, Thiago, David
5.16 DBpedia Spotlight – Better Surface Form Matching
Spotlight extracts entities by first locating the word sequences (surface forms, mentions) that might designate a wikipedia entity. Then, it looks up the wikipedia entities that have been associated to that specific surface form and calculates which entity is the best match in the current context. The association between surface form and candidate entities rests solely on the links found in wikipedia. This captures the way in which people actually annotated data in wikipedia, but fails to capture ways in which people might use words to refer to entities. For example, suppose the surface form 'The giants' has been used in a link to The New York Giants and nothing else; whereas the surface form 'Giants' has been used in a link to The San Francisco Giants, and nothing else. This means that if the surface form 'The giants' is extracted in the spotting phase, then it would be impossible for spotlight to link that surface form to the San Francisco Giants (regardless of context), because that entity simply doesn't figure as a link candidate for that surface form. Can you think of ways to address the following:
– How to deal with linguistic variation: lowercase/uppercase surface forms, determiners, accents, unicode, in a way such that the right generalisations can be made and some form of probabilistic structured can be determined in a principled way .
– Improve the memory footprint of the stores that hold the surface forms and their associated entities .
Tags: Scala, Java, Natural Language Processing, Machine Learning
Mentors: Joachim, Pablo, Thiago, David
5.17 DBpedia Spotlight – Better Tools for Model Creation
DBpedia Spotlight's current tools for building Name entity Recognition models (pignlproc)
have been designed for older versions of apache hadoop and apache pig. Running it often results in problems and generating models for large datasets require lots of hours of expensive hardware. Moreover, it generates data that is geared towards the data structures used within Spotlight. We propose to re-build this tool, so it can (i) run more effectively, (ii) be maintained more easily, (iii) aggregate more information about that surface forms / entities so that it can be used in the development of DBPedia spotlight or be re-used by other entity-linking projects (e.g. what information is extracted by Wikipedia Miner, TAGME and others?). Apache Spark seems to be the technology of choice to tackle these issues.
Tags: Entity Linking, Spark, Java, Scala, Wikipedia, Natural Language Processing
Mentors: Joaquim, Pablo, Thiago, David
5.18 DBpedia Spotlight – Model Editor/Domain Adaptation tool [not a priority]
Given that we are already running a webserver. It would be great to have a viewer of the model on top of the server, some endpoints which could be hit to get :
An Entity’s surface forms, context vectors and probabilities.
Candidates attached to a surface form etc.
This would be useful for jumping into it when someone reports a weird behaviour, as well as for assessing the model creation.
To make it more interesting it could be transformed into an editor for "domain adaptation" in which it would consume a corpus and then introduce some changes into the model.
Tags: Java, Scala, Natural Language Processing
Mentors: Joachim, Pablo, Thiago, David
5.19 DBpedia Spotlight – Confidence/Relevance Scores
DBpedia Spotlight takes a confidence value as input and applies it in three filters based on it (i) one related to the probability of a surface form, (ii) another related to the score of the second best scoring candidate entity (iii) and another related to some semantic measure. The problem is that there is no unifying value in these three filters that can be used as a measure of the confidence that the system has in selecting the final entity. This raises problems for an automation of the calculation of precision-recall metrics, but also makes the system unable to tell how relevant that entity is to the text. On that light, you could try to answer the following:
(i) Is it interesting to have a unifying value for the selected candidate? How would you combine the values from the filters that are already in place ?
(ii) can the notion of entity relevance be equated with that of confidence ?
The community has some ideas on how to measure the relevance of an entity within an extraction. A simple baseline is already implemented, but that could be improved further. An annotated corpus with ranked entities could be created so we can compare different approaches.
Tags: Scoring functions, Scala, Java, Natural Language Processing, Named entity recognition and disambiguation.
Mentors: Joachim, Pablo, Thiago, David
7 More Information
- More about our project proposal
- Where can I apply? Directly at the GSoC 2015 website.
- Where can I get more information about GSoC 2015? Directly at the GSoC FAQs.
- What should I include in my application? We've put together a template application for you.
- Documentation: DBpedia (public wiki, github wiki (better))
- Open issues / feature requests
- Our users mailing list archives
- Our developers mailing list archives