DBpedia & DBpedia Spotlight Proposal for the Google Summer of Code 2013
DBpedia & DBpedia Spotlight
DBpedia (http://dbpedia.org) and DBpedia Spotlight (http://spotlight.dbpedia.org) are two projects that have strong ties (topic-wise as well as data-wise as well as community-wise), which is why we decided to join forces and apply together, this year.
Almost every major Web company has now announced their work on a knowledge graph, including Google’s Knowledge Graph, Yahoo!’s Web of Objects, Walmart Lab’s Social Genome, Microsoft's Satori Graph / Bing Snapshots and Facebook’s Entity Graph.
DBpedia is a community-run project that has been working on a free, open-source knowledge graph since 2006. DBpedia currently exists in 97 different languages, and is interlinked with many other databases (e.g. Freebase, New York Times, CIA Factbook) and hopefully, with this GSoC to Wikidata, too. The knowledge in DBpedia is exposed through a set of technologies called Linked Data. Linked Data has been revolutionizing the way applications interact with the Web. While the Web2.0 technologies opened up much of the “guts” of websites for third-parties to reuse and repurpose data on the Web, they still require that developers create one client per target API. With Linked Data technologies, all APIs are interconnected via standard Web protocols and languages.
One can navigate this Web of facts with standard Web browsers, automated crawlers or pose complex queries with SQL-like query languages (e.g. SPARQL). Have you thought of asking the Web about all cities with low criminality, warm weather and open jobs? That's the kind of query we are talking about.
This new Web of interlinked databases provides useful knowledge that can complement the textual Web in many ways. See, for example, how bloggers tag their posts or assign them to categories in order to organize and interconnect their blog posts. This is a very simple way to connect "unstructured" text to a structure (hierarchy of tags). For more advanced examples, see how BBC has created the World Cup 2010 website by interconnecting textual content and facts from their knowledge base. Identifiers and data provided by DBpedia were greatly involved in creating this knowledge graph. Or, more recently, did you see that IBM's Watson used DBpedia data to win the Jeopardy challenge?
DBpedia Spotlight is an open-source (Apache license) text annotation tool that connects text to Linked Data by marking names of things in text (we call that Spotting) and selecting between multiple interpretations of these names (we call that Disambiguation). For example, “Washington” can be interpreted in more than 50 ways including a state, a government or a person. You can already imagine that this is not a trivial task, especially when we're talking 3.64 million “things” of 320 different “types” with over half a billion “facts” (July 2011).
After a successful GSoC2012 with DBpedia Spotlight, this year we join forces with the DBpedia Extraction Framework and other DBpedia-family products. We got excited with our new ideas, we hope you will get excited too!
Organization home page url
Main organization license
GNU GPL v.2 & Apache licence
If you chose "veteran" in the dropdown above, please summarize your involvement and the successes and challenges of your participation. Please also list your pass/fail rate for each year
During GSoC 2012, we had the pleasure and honor to work with 4 students to enhance DBpedia Spotlight in time performance, accuracy and extra functionality. The core model we use for automatic disambiguation is based on a large vector space model of words, and one student project (by Chris Hokamp) included processing all the data on Hadoop, as well as analyzing the dimensions of this model using techniques such as Latent Semantic Analysis, Explicit Semantic Analysis, etc. A second project (by Joachim Daiber) implemented a probabilistic interpretation of the disambiguation model, and provided a key-value store implementation that allows for efficiency and flexibility in modifying the scoring techniques. Our third project (by Dirk Weissenborn]) included topical classification in our model and live updating/training of the models as Wikipedia changes (or news items are released) so that DBpedia Spotlight can be kept up to date with the world, as soon as events happen. Finally, the fourth project (by Liu Zhengzhong — a.k.a. Hector) provided an implementation of collective disambiguation. In this approach, each of the things that are found in the input text contribute to finding the meaning of the other things in the same text through graph algorithms that benefit from the structure of our knowledge base.
Together, these four projects greatly enhanced DBpedia Spotlight towards achieving its objective of serving as a flexible tool that can cater to many different applications interested in connecting documents to structured data.
Why is your organization applying to participate in Google Summer of Code 2013? What do you hope to gain by participating?
We are interested in seeing both research and development into the DBpedia & DBpedia Spotlight software itself, and applications built around it. We hope to gain both new developers who are familiar with the core, as well as platform-oriented developers who can help us to lead research-oriented ideas in the directions needed by applications wishing to take advantage of that.
What is the URL for your Ideas list?
What is the main development mailing list for your organization?*
What criteria did you use to select your mentors for this year's program? Please be as specific as possible.
Our mentors consist of reliable members of the DBpedia and DBpedia Spotlight community. All of them, but two, have been active for more than one year now and all have submitted hundreds (of thousands) of commits to our code base or the mappings wiki. They all hold a stake in the DBpedia project, as their daily work or academic projects (e.g. PhD thesis) relies in some form on the output produced by DBpedia and DBpedia Spotlight
List of mentors (alphabetically):
- Christopher Sahnwaldt (DBpedia, jcsahnwaldt) lives in Berlin and tries to make a living writing software. He was paid for writing and running code for DBpedia in 2009 and 2012, at other times he volunteers while working on other projects. Just to confuse his friends and colleagues, he sometimes uses his other first name Jona.
- Dirk Weissenborn (DBpedia Spotlight) was a successful student in last year's GSoC.
- Julien Cojan (DBpédia en français) lives in Nice. He works in INRIA lab, and maintains the French chapter of DBpedia.
- Joachim Daiber (DBpedia Spotlight, jodaiber) studies Natural Language Processing in Prague and Groningen. He co-maintains the DBpedia Spotlight project.
- Marco Fossati (DBpedia, hell.j.fox) lives in Trento, Italy. He works in the Web of Data unit at Fondazione Bruno Kessler (FBK) and in SpazioDati. He is responsible of the Italian DBpedia data curation.
- Sebastian Hellmann (DBpedia, kurzum) lives in Leipzig and develops the NLP Interchange Format. He was a GSoC mentor two years ago for DBpedia (on behalf of Apertium) and has developed the initial version of the DBpedia-Live extraction. Furthermore he (co-)chaired the Open Knowledge Conference 2011, the Linked Data Cup 2012 and is a member of over a dozen open source projects on Google code and Sourceforge.
- Max Jakob (DBpedia Spotlight, maxjakob) lives in Berlin. He works at Neofonie GmbH and is co-creator of DBpedia Spotlight. He also formerly maintained the DBpedia project.
- Dimitris Kontokostas (DBpedia, jimkont). Dimitris lives in Veria, Greece. He is a Researcher at AKSW Group of Leipzig University and co-maintains the DBpedia project.
- Gerard Kuys (DBpedia), A historian of social and economic history who has wandered into information organisation (structure, findability, knowledge models, ICT and content management). He contributes to the mappings wiki and would like to have more features, when editing. He is a member of the Werkgroep Wiki Loves Monuments at Wikipedia.
- Pablo N. Mendes (DBpedia & DBpedia Spotlight, pablomendes) is Brazilian, now living in Berlin. He is researching Linked Data and Information Extraction. He co-created and maintains the projects DBpedia Spotlight, DBpedia Portuguese, Twarql, and Cuebee.
- Jimmy O'Regan (DBpedia, jimregan) lives in Ireland. He has been involved in GSoC for a number of years with Apertium, and is involved in the internationalization of DBpedia.
- Charalampos Bratsas (DBpedia, char.brat) lives in Greece. He leads the R&D efforts of the Greek Linked Open Data which is currently part of the overall global 2011 LOD cloud with the datasets like DBpedia in Greek, Hellenic Police, Hellenic Fire Brigade and many others.
- Lydia Pintscher (Wikidata, not a mentor but a point of contact for Wikidata) lives in Berlin. She does community communications for Wikidata at Wikimedia Deutschland.
- Volha Bryl (DBpedia, volha) lives in Germany, is a researcher at the University of Mannheim, Germany, works on DBpedia data fusion across multiple languages.
What is your plan for dealing with disappearing students?
We will ask the students for their schedule well in advance of the program start. Any unscheduled absences (over 72 hours) will be reported to the administrators, who will make attempts to contact the student. If these attempts fail, the project will be frozen, and we will contact Google.
What is your plan for dealing with disappearing mentors?
Each student will be assigned at least two main mentors, who will both follow the student's progress. In the event of one disappearing, the other will continue. Additionally, co-mentors will be assigned to support the main mentors and can step up as a replacement. We have intentionally limited the pool of mentors to those who feel they can dedicate their time to an entire project, though the projects can expect support and additional mentoring from other community members, who cannot make this time commitment.
What steps will you take to encourage students to interact with your project's community before and during the program?
In addition to work to document the internals, we have been taking steps towards making the project more community-centered: increased use of wikis, the creation of an IRC channel, and more open discussion on the mailing list. The steps towards internationalizing the software have also increased its scope, and we are seeing early interest due to that.
In an initial phase, we will introduce students to key members of the community. The students can join our telcos. In the beginning, we will further ask them to contribute to micro tasks (e.g. improving documentation, fix bugs, answer support emails), which can potentially earn them an initial good standing. Of course, this is optional and will not be enforced strictly, but it would break the ice in the beginning.
During the application period, we will built a small web application demonstrating how our tool could be used to help students to explore GSoC project proposals:
We hope this will stimulate applicants to learn more about the project.
During the program, we hope to maintain our IRC channel and mailing list as open channels where the students can discuss their work in a low-pressure environment.
After the program, we intend to encourage our students to co-write academic papers about their work, to further both the students' academic careers, and our projects own academic interests.
What will you do to encourage that your accepted students stick with the project after Google Summer of Code concludes?
We try to be very open and transparent in all our operations. Every external contribution is publicly visible and we believe that this creates a better bonding through the community and an incentive to participate further. As two of this year's mentors were students last year, we believe this model to be effective.
We are furthermore planning to create a more dedicated attribution page, as well as badges/positions to bond community members to the project. DBpedia has a very high reputation throughout the Web of Data and in the RDF world. We are proud of our best practices, working software and high-quality data, which is the reason of the high usage of our work. Affiliation will give any student concrete career advantages (besides learning best practices in Semantic Web standards).
Are you a new organization who has a Googler or other organization to vouch for you? If so, please list their name(s) here
Are you an established or larger organization who would like to vouch for a new organization applying this year? If so, please list their name(s) here