ChangeLog
DBpedia 3.8 (08/2012)
- Core Framework:
- Cleaned up Maven POM files
- Some new Scala classes need at least
JDK 7. The framework would run, but does not compile, with earlier versions. Since this is inconvenient for some users, we plan to deploy the DBpedia framework in the main Maven repository, so users don't have to compile the framework anymore.
- Cleaned up handling of URIs / IRIs
- To save file system space, (optionally) compress DBpedia triple files while writing and decompress Wikipedia XML dump files while reading
- Triple file format selection is more configurable and flexible
- Additionally to N-Triples and N-Quads, the framework can write triple files in Turtle and Turqle format
- We download namespace names and redirect marker words for all Wikipedia editions semi-automatically and use them to improve parser precision
- Users can download ontology and mappings from mappings wiki and store them in files to avoid downloading them for each extraction, which takes a lot of time and makes extraction results less reproducible
- Preparation time and effort for abstract extraction is minimized, extraction time is reduced to a few milliseconds per page
- Using some bit twiddling, we can now load all ~200 million inter-language links into a few GB of RAM and analyze them
- During extraction, we use IRIs with 'local' domains (e.g. http://en.dbpedia.org/resource/Boötes) and, if desired, convert them to 'generic' domain and/or URI syntax (e.g. http://dbpedia.org/resource/Bo%C3%B6tes) during serialization
- To allow extracting data from multiple Wikipedia dumps for a language, the extraction uses the same folder and file name structure as the Wikipedia dump files
- Download code is more flexible, configurable and intelligent. It semi-automatically checks for new dumps and only downloads files that have changed.
- Cleaner Extractor class hierarchy
- Extraction steps that looked for links between different Wikipedia editions were replaced by more powerful post-processing scripts
- Article count ranges, e.g. 10000–100000, can now be used in all configuration files as an alternative to language lists, e.g. en,de,fr
- Introduced a simple Scala
DSL for simple but efficient parsing and transformation of XML, based on the
Streaming API for XML.
- Reduced use of Scala default arguments, they more often than not lead to errors of omission
- Reduced use of Scala Option, which often leads to code bloat without adding benefits
- Using more mutable collections and fewer wrapper classes improves performance and simplifies code
- Data / Internationalization:
- Mapping-based datasets are now also available for Arabic, Basque, Bengali, Bulgarian, Czech, and Korean
- New language chapters:
French,
Italian and
Japanese.
- As percent-encoding is strongly discouraged by the
RDF specification, DBpedia's URI encoding avoids it as much as possible
- We now use IRIs for all languages except English, which uses URIs for backwards compatibility
- Turtle files (.ttl, .tql) always use IRIs, even for English
- New dataset with sameAs links between URI and IRI versions of a resource to improve backwards compatibility and compatibility between chapters
- new inter-language link datasets for all languages:
- interlanguage_links contains all inter-language-links from one Wikipedia edition to others
- interlanguage_links_same_as contains bijective inter-language-links, using propery owl:sameAs
- interlanguage_links_see_also contains non-bijective inter-language-links, using propery rdfs:seeAlso
- Internationalized dataset files are located in the main directory for a language, not in a separate -i18n folder as in release 3.7, and their names are directly derived from the dataset name.
- Canonicalized dataset files are also located in the main directory for a language, but their names are constructed by appending '_en_uris' to the dataset name.
- In addition to N-Triples (.nt) and N-Quads (.nq) files, we also offer the datasets in Turtle (.ttl) and 'Turqle' (.tql) formats
- New dataset with transitive redirects – multiple redirects have been resolved, redirect cycles have been removed
- We now resolve redirects in all datasets where the objects are also DBpedia resources
- For cases in which the original, unredirected link target is needed, we also offer dataset files named '_unredirected'.
- Bugfixes:
- Fixed handling of URIs/IRIs
- When extracting raw infobox properties, make sure that predicate URI can be used in RDF/XML by appending an underscore if necessary
- Check that extracted date is valid (e.g. February never has 30 days) and its format is valid according to its XML Schema type, e.g. xsd:gYearMonth
- Remove HTML character entities in abstracts
- Parser now accepts space as grouping separator in most numbers
- Linksets: fix several URI and N-Triples encoding bugs
- Freebase links: use correct syntax for Freebase RDF URIs
- Fixed remaining N-Triples / Turtle encoding bugs
- Fixed thread-safety bug in date and number format parsers that caused values to be mixed between extracted pages
- Several Scala classes contained code to find the related classes (equivalent and base classes) of an ontology class. Moved this code to one class and fixed bugs.
- Fixed bug in download code that lead to broken files
- page_ids and revision_ids datasets now use the DBpedia resource as subject URI, not the Wikipedia page URL
- use foaf:isPrimaryTopicOf instead of foaf:page for link from DBpedia resource to Wikipedia page
- In Java, the upper case for 'ß' is 'SS', but MediaWiki titles may start with 'ß'. Similar for some ligatures like 'fi'. We no longer convert these to upper case if they appear as the first character in DBpedia URIs. (Some datasets were extracted before this bug was fixed.)
- Redirects extracted from older Wikipedia dumps were checked into the source code repository. This led to subtle and surprising bugs, so we removed them. Redirects are now extracted on demand and then stored in the extraction directory.
- Decode HTML-encoding and URI-encoding in Wikipedia link targets
- Ontology:
- Since the DBpedia 3.7 release, the DBpedia community added 40 classes and 132 properties on the
mappings wiki. The DBpedia 3.8 ontology encompasses 359 classes and 1775 properties (800 object properties, 859 datatype properties using standard units, 116 datatype properties using specialized units)
- Since the DBpedia 3.7 release, the DBpedia community added 40 classes and 132 properties on the
- Mappings:
- Mappings for several new languages: Arabic, Basque, Bengali, Bulgarian, Czech, Hindi, Japanese, Korean. (Mapping-based datasets are not available for Hindi and Japanese because when the 3.8 extraction ran, there weren't any mappings yet for these languages.)
- Major improvements in server responsiveness by keeping more data in memory, especially for statistics
- Users can specify revision ID when extracting samples from Wikipedia pages, not just page title
- Radically reduced effort required to add a mapping namespace for a new language by removing duplicated code in several places
- New mappings wiki syntax for labels and comments to make adding a mapping namespace for a new language easier
- Links to external datasets:
- Created new Freebase links based on Freebase dump from June 2012
Here is a list of available datasets, along with the number of languages they were generated for:
| Dataset | DBpedia 3.8 | DBpedia 3.7 |
| Ontology Infobox Types | 22 | 16 |
| Ontology Infobox Properties | 22 | 16 |
| Ontology Infobox Properties (Specific) | 22 | 16 |
| Titles | 111 | 97 |
| Short Abstracts | 111 | 97 |
| Extended Abstracts | 111 | 97 |
| Images | 8 | 6 |
| Geographic Coordinates | 111 | 97 |
| Raw Infobox Properties | 111 | 97 |
| Raw Infobox Property Definitions | 111 | 97 |
| Homepages | 12 | 10 |
| Persondata | 2 | 2 |
| PND | 3 | 2 |
| Inter-Language Links | 111 | 4 |
| Bijective Inter-Language Links | 111 | – |
| Non-bijective Inter-Language Links | 111 | – |
| Articles Categories | 111 | 1 |
| Categories (Labels) | 111 | 1 |
| Categories (Skos) | 111 | 1 |
| External Links | 111 | 1 |
| Links to Wikipedia Article | 111 | 97 |
| Wikipedia Pagelinks | 111 | 97 |
| Redirects | 111 | 1 |
| Transitive Redirects | 111 | – |
| Disambiguation links | 12 | 9 |
| IRI-same-as-URI links | 111 | – |
| Page IDs | 111 | 1 |
| Revision IDs | 111 | 1 |
| Revision URIs | 111 | – |
DBpedia 3.7 (08/2011)
- Framework:
- Redirects are resolved in a post-processing step for increased inter-connectivity of 13% (applied for English datasets)
- Extractor configuration using the dependency injection principle
- Simple threaded loading of mappings in server
- Improved international language parsing support thanks to the members of the
Internationalization Committee
- Bugfixes:
- Encode homepage URLs to conform with N-Triples spec
- Correct reference parsing
- Recognize Media Wiki parser functions
- Raw infobox extraction produces more object properties again
- skos:related for category links starting with :" and having and anchor text
- Restrict objects to Main namespace in Mapping Extractor
- Double rounding (e.g. a person's height should not be 1800.00000001 cm)
- Start position in abstract extractor
- Server can handle template names containing a slash
- Encoding issues in YAGO dump
- Ontology :
- owl:equivalentClass and owl:equivalentProperty mappings to
http://schema.org
- Note that the ontology now is a directed-acyclic graph. Classes can have multiple superclasses, which was important for the mappings to schema.org. A taxonomy can still be constructed by ignoring all superclass but the one that is specified first in the list and is considered the most important.
- owl:equivalentClass and owl:equivalentProperty mappings to
- Mappings:
-
Dynamic statistics for infobox mappings showing the overall and individual coverage of the mappings in each language
-
ConstantProperty mappings
- Language specification for string properties in
PropertyMappings
- Multiplication factor in
PropertyMappings
-
- Links to external datasets:
- New links: Umbel, EUnis, Linked MDB, (geospecis)
- Updates for: Freebase, Word Net, Opencyc, New York Times, Drugbank, Diseasome, Flickrwrapper, Sider, Factbook, DBLP, Eurostat, Dailymed, Revyu
DBpedia 3.6 (01/2011)
-
Ontology mappings in six languages.
- Improved parsing coverage.
- Lists of elements in Infobox property values.
- Missing repeated links in Infoboxes.
- Flag templates.
- Various improvements on Internationalization.
- Improved recognition of
- Wikipedia namespace identifiers.
- Wikipedia language codes.
- Category hierarchies.
- Bugfix: encoding of complete range of Unicode code points (up to 0x10ffff).
- 16-bit code points start with '\u', code points larger than 16-bits start with '\U'.
- Bugfix: URI encoding is now closer to the one in Wikipedia.
- Commas and ampersands do not get percent-encoded anymore.
- Disambiguation links for acronyms.
- Wikilinks consisting of multiple words: If the starting letters of the words appear in correct order (with possible gaps) and cover all acronym letters.
- Wikilinks consisting of a single word: If the case-insensitive longest common subsequence with the acronym is equal to the acronym.
- New extractor: Geo-Related.
- More
XSD datatypes.
- xsd:positiveInteger
- xsd:nonNegativeInteger
- xsd:nonPositiveInteger
- xsd:negativeInteger
- Several property name changes.
- Person Data
-
http://dbpedia.org/ontology/deathPlace replaces
http://dbpedia.org/property/deathPlace
-
http://dbpedia.org/ontology/deathDate replaces
http://dbpedia.org/property/death
-
http://dbpedia.org/ontology/birthPlace replaces
http://dbpedia.org/property/birthPlace
-
http://dbpedia.org/ontology/birthDate replaces
http://dbpedia.org/property/birth
-
http://xmlns.com/foaf/0.1/givenName replaces
http://xmlns.com/foaf/0.1/givenname
-
- Category Data
-
http://purl.org/dc/terms/subject replaces
http://www.w3.org/2004/02/skos/core#subject
- Additionally,
http://www.w3.org/2004/02/skos/core#related is extracted as a relation between categories.
-
- Page IDs
- Revision IDs
- Page Links
- External Links
- Redirects
- Disambiguations
- Person Data
- Updated Freebase links.
- They now refer to
mids because guids have been deprecated.
- They now refer to
- Updated YAGO links.
- Thanks to
Johannes Hoffart.
- Thanks to
DBpedia 3.5.1 (04/2010)
- Generated new Freebase links.
- Links to external data sets validated for Virtuoso import.
- Abstract quality improved (less infobox occurrences).
- Inter-language link detection corrected.
- Datatypes are now located in the
http://dbpedia.org/datatype/ namespace (was
http://dbpedia.org/ontology/).
- Recognition of disambiguation pages fixed
- Removed image references to dummy images (e.g.,
http://en.wikipedia.org/wiki/Image:Replace_this_image.svg)
DBpedia 3.5 (04/2010)
- The DBpedia extraction framework has been completely rewritten in Scala. The new framework dramatically reduces the extraction time of a single Wikipedia article from over 200 to about 13 milliseconds. All features of the previous PHP framework have been ported. In addition, the new framework can extract data from Wikipedia tables and is able to extract multiple infoboxes out of a single Wikipedia article. The data from each infobox is represented as a separate RDF resource. All resources that are extracted from a single page can be connected using custom RDF properties which are defined as part of the extraction mappings.
- The mapping language that is used to map Wikipedia infoboxes to the DBpedia Ontology has been redesigned.
- In order to enable the DBpedia user community to extend and refine the infobox to ontology mappings, the mappings can be edited on the newly created wiki hosted on
http://mappings.dbpedia.org/.
- At the moment, 303 template mappings are defined, which cover (including redirects) 1055 templates.
- On the wiki, the DBpedia Ontology can be edited by the community as well. At the moment, it contains 259 classes and about 1,200 properties.
- The ontology properties extracted from the infoboxes are now split into 2 data sets. (For details, see the Datasets page.)
- The Ontology Infobox Properties dataset contains the properties as they are defined in the ontology (e.g., length). The range of a property is either a xsd schema type or a dimension of measurement, in which case the value is normalized to the respective SI unit.
- The Ontology Infobox Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit (e.g., the property height is specialized on the class Person using the unit centimetres instead of metres).
- The framework now resolves template redirects, making it possible to cover all redirects to an infobox on Wikipedia with a single mapping.
- In addition to the new mapping based extractor, three new extractors have been implemented:
- Page Id Extractor extracting Wikipedia page IDs are extracted for each page
- Revision Extractor extracting the latest revision of a page
- PNDExtractor extracting PND (Personnamendatei) identifiers
- The data set now provides labels, abstracts, page links, and infobox data in 92 different languages, which have been extracted from recent Wikipedia dumps as of March 2010.
- In addition to the N-Triples datasets, N-Quads datasets are provided which include a provenance URI for each statement. The provenance URI denotes the Wikipedia origin of the extracted triple. (For details, see the Datasets page.)
- The URI for extended abstracts was changed from
http://dbpedia.org/property/abstract to
http://dbpedia.org/ontology/abstract.
DBpedia 3.4 (11/2009)
- update to Wikipedia dumps generated 2009 September 24–29
- number of ontology instances (mapping based extraction) rose to 1.17 million, overall number of instances to 2.9 million.
- labels, abstracts, generic template properties etc. are available in 91 different languages now (see
http://downloads.dbpedia.org/3.4/)
- introduced strict and loose infobox ontology in order to meet different requirements of client applications (see http://wiki.dbpedia.org/Datasets#h18-11)
- As Wikipedia has moved to dual-licensing, we also dual-license DBpedia. DBpedia 3.4 is thus licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License
- also use templates that are redirects to known templates
- improved template to ontology mapping
- normalize Wikipedia units of measurement to ontology units of measurement
- ontology uses owl:Thing instead of Resource
- faster and more accurate abstract extraction, less redundant code
- store Wikipedia dump timestamps
- fixed some problems in importing Wikipedia dump into MySQL
- abstract extraction now needs fewer patches in the Wikipedia installation
- abstract extraction now needs one Wikipedia installation, not one per language
- fixed redirect extractor
- fixed some encoding problems
- made extraction more resilient to MySQL problems
- bugfix in link extractor: don't extract empty links
- converted mapping configuration from Excel to CSV files
- fixed some bugs in mapping.csv
- added mappings for more templates and classes (cars, space stuff, awards, species, ...)
- re-implemented SQL iterator classes to make them more flexible while minimizing code redundancy
- handle 'typed' templates like {{Geobox|River}}
- detect redirect and disambig pages using magic words for each language
- fixed our use of foaf:img / foaf:depiction
- generate xsd:gYear, xsd:gMonth, ... values
DBpedia 3.3 (07/2009)
- update to Wikipedia dumps generated in May 2009
- more accurate abstract extraction
- labels and abstracts in 80 languages (see
http://downloads.dbpedia.org/3.3/)
- infobox extraction bugfixes
- new links to Dailymed, Diseasome, Drugbank, Sider, TCM
- updated Open Cyc links
DBpedia 3.2 (11/2008)
- update to Wikipedia dumps generated in October 2008
- new DBpedia Ontology introduced.
- new infobox extraction framework to populate the ontology.
- added initial infobox to ontology mappings.
- datatype extraction code improved.
- abstract extraction code improved.
- new Freebase to DBpedia RDF links.
- updated Open Cyc to DBpedia RDF links.
DBpedia 3.1 (08/2008)
- update to Wikipedia dumps generated in June 2008
- YAGO mapping improvements:
- YAGO itself has improved their algorithms
- the DBpedia-YAGO mapping is now generated by a YAGO converter and should have much better quality compared to the previous release
- Geo Extractor improvements:
- Geo-coordinates are now also provided in the
W3C Geospatial Vocabulary using Geo RSS Simple encoding. The
Basic Geo (WGS84 lat/long) Vocabulary is still supported due to its ease of use in SPARQL queries.
- International wikis are now included in the geo-coordinate extraction
- Support for further geo templates such as the proposed
'Coordinate' format
- Geo-coordinates are now also provided in the
- Bugs fixed:
- #1871653 Too long URIs by infoboxes extractor cause import problems
- #1964434 illegal \ char in URIs
- #1970387 Filter out references
- #1964632 Illegal URIs
- #1947512 timespan extraction
- Fixed unwanted MySQL connection pooling and corrected database names for infobox and image extractors
- Fixed internal encoding of international page IDs
DBpedia 3.0 (02/2008)
DBpedia 3.0 comes with the following changes (includes those changes between DBpedia 2.0 and DBpedia 3.0RC):
- multi-language improvements: extractors now applied to up to 14 different languages (not all extractors work on all languages)
- redirects data set available
- image copyright issues:
- the image extractor tries not to extract non-free images anymore (however, we cannot guarantee that it will not still happen)
- most of the extracted image URLs now contain an additional triple: $image dc:rights $wikiPageDescribingRights; always link back to the corresponding wiki page if you use images in your DBpedia based applications
- experimental (and still buggy) alternative DBpedia class hierarchy system:
- close to the Wikipedia category system but with several filters applied to it (categories which are bad candidates for OWL classes are to some degree filtered out, circles in the hierarchy are removed, administrative categories removed, etc.)
- improvements in extraction code:
- package structure in extraction code improved
- new Global Extractor Interface for non-article dependent extractions
- URI Exception for erroneous URIs
- new Linked Data Sets available:
- Links to Cyc
- Links to the flickr wrappr
- Links to Wikicompany
- Bugs fixed:
- #1818011: Labels for resources with colon character
- #1793163: HTML linebreaks are lost
- #1829160: Incorrect assignment of pages to categories
- #1819301: Missing plural redirects
- #1814938: Duplicates in pagelinks
- #1797810: Persondata dump should be labeled as German
- #1813011: Extra label in category wiki links
- #1871653: Too long URIs by infoboxes extractor cause import problems
- #1817019: Incorrect capitalization for XML Schema Datatypes
- #1730445: DBpedia browser page title = "テレビプロデューサー"
- #1724322: rudi völler – 404 links
- #1722279: Language code within Chinese Abstracts
- URIs with leading digit escaped by _
- Person Data Extractor: wrong date format (leading 0)
- Triples with over-sized erroneous URIs will not be extracted
- Incorrect assignment of pages to categories
- ... and many more ...
- Feature Requests incorporated:
- Extraction from Disambiguation Pages
- Extraction from Redirect Pages
- #1860862 Ordering of given- and surname in Personendaten Extractor
DBpedia 2.0 (09/2007)
- Improved the Data Quality
- Third Classification Schema Added: concepts are now also classified by associating them to Word Net synsets
- Geo-Coordinates: data set contains geo-coordinates for geographic locations using the W3C Basic Geo Vocabulary
- RDF Links to other Open Data Sets: The data set now contains 440,000 external RDF links into the
- Geonames,
- Musicbrainz,
- Word Net,
- World Factbook,
- Euro Stat,
- Book Mashup,
- DBLP Bibliography, and
- Project Gutenberg data sets.
DBpedia 1.0 (03/2007)
Initial Release of the DBpedia Data Sets, including:
- better short abstracts (stuff like unnecessary brackets has been removed from the abstracts)
- new extended abstracts for each concept (up to 3000 characters long)
- abstracts in 10 languages (German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, and Chinese)
- 2.8 million new links to external Web pages
- Cleaner infobox data
- 10,000 additional RDF links into the Geonames database.
- 9000 new RDF links between books in DBpedia and data about them provided by the RDF Book Mashup
- 200 RDF links between computer scientists in DBpedia and their publications in the DBLP database
- New classification information for geographic places using DBpedia terms and Geonames feature codes
There are no files on this page.
[Display files/form]
There is no comment on this page.
[Display comments/form]
Information
Last Modification:
2012-08-06 08:20:58 by Christopher Sahnwaldt