ChangeLog

Contents

Dataset overview table


Here is a list of available datasets, along with the number of languages they were generated for:


Dataset DBpedia 3.9DBpedia 3.8DBpedia 3.7
Mapping-based Types 24 22 16
Mapping-based Properties 24 22 16
Mapping-based Properties (Specific) 24 22 16
Titles 11911197
Short Abstracts 11911197
Extended Abstracts 11911197
Images 10 8 6
Geographic Coordinates 11911197
Raw Infobox Properties 11911197
Raw Infobox Property Definitions 11911197
Homepages 12 12 10
Persondata 4 2 2
PND 3 3 2
Inter-Language Links 1191114
Bijective Inter-Language Links – (see below)111
Non-bijective Inter-Language Links – (see below)111
Articles Categories 1191111
Categories (Labels) 1191111
Categories (Skos) 1191111
External Links 1191111
Links to Wikipedia Article 11911197
Wikipedia Pagelinks 11911197
Redirects 1191111
Transitive Redirects 119111
Disambiguation links 13 12 9
IRI-same-as-URI links 119111
Page IDs 1191111
Revision IDs 1191111
Revision URIs 119111

DBpedia 3.9 (09/2013)

  • Core Framework:
    • moved source code repository from SourceForge to GitHub
    • bumped Scala version to 2.10.2
    • unified handling of zipped files
    • handle new MediaWiki page formats (Andrea Di Menna)
    • refined rules for URIs of sub-resources, e.g., for Wikipedia pages having multiple infoboxes
    • smarter CPU load handling to avoid crash during abstract extraction
    • enable download through proxy with authentication (Julien Plu)
    • new URI policy that rejects URIs longer than 500 Unicode characters (Andrea Di Menna)
    • added mapping parameter: extract only first or last element of a list (Andrea Di Menna)
    • added compareTo(), equals(), hashCode(), canEqual() to class Quad
    • added class DeduplicatingDestination to drop duplicate triples (was not used yet for 3.9)
  • Data / Internationalization:
    • Wikipedia interlanguage links are now maintained on Wikidata. We generated the interlanguage links datasets from a Wikidata dump prepared by Daniel Kinzler, not from Wikipedia dumps. This also means that we do not have to split them into bijective and non-bijective links. We still extracted the few remaining old interlanguage links from Wikipedia dumps and offer the datasets on the download server, as they might be useful in special cases.
    • Mapping-based datasets are now also available for Indonesian and Japanese
    • New language chapter: Dutch
    • We enabled the following additional languages on the mappings wiki: Esperanto, Estonian, Indonesian, Slovak, Urdu, Chinese
    • links between DBpedia and Wikidata properties can now be added on the mappings wiki
    • renamed “infobox” dataset files to “raw_infobox” to avoid confusion
    • Arabic extraction/parsing configurations (Ahmed Ktob)
    • new specialized extractor for population list pages on French Wikipedia (Julien Plu)
    • improved decimal number parsing (Julien Cojan)
    • extraction was based on mappings, classes, and properties defined on the mappings wiki as of 2013–07–02
  • Bugfixes:
    • improved length parsing for foot / inch
    • mysql.sh now takes cleaner, less error-prone command-line arguments
  • Ontology:
    • since the DBpedia 3.8 release, the DBpedia community added 170 classes and 558 properties on the mappings wiki. The DBpedia 3.9 ontology encompasses 529 classes and 2217 properties (927 object properties, 1290 datatype properties using standard units, 116 datatype properties using specialized units)
  • Links to external datasets:
    • created new Freebase links based on Freebase dump from May 2013
    • created Wikidata links based on Wikidata dump from June 2013
    • created new YAGO links based on YAGO dumps from December 2012
    • created new GeoNames links based on Geo Names dumps from September 2013

DBpedia 3.8 (08/2012)

  • Core Framework:
    • Cleaned up Maven POM files
    • Some new Scala classes need at least JDK 7. The framework would run, but does not compile, with earlier versions. Since this is inconvenient for some users, we plan to deploy the DBpedia framework in the main Maven repository, so users don't have to compile the framework anymore.
    • Cleaned up handling of URIs / IRIs
    • To save file system space, (optionally) compress DBpedia triple files while writing and decompress Wikipedia XML dump files while reading
    • Triple file format selection is more configurable and flexible
    • Additionally to N-Triples and N-Quads, the framework can write triple files in Turtle and Turqle format
    • We download namespace names and redirect marker words for all Wikipedia editions semi-automatically and use them to improve parser precision
    • Users can download ontology and mappings from mappings wiki and store them in files to avoid downloading them for each extraction, which takes a lot of time and makes extraction results less reproducible
    • Preparation time and effort for abstract extraction is minimized; extraction time is reduced to a few milliseconds per page
    • Using some bit twiddling, we can now load all ~200 million inter-language links into a few GB of RAM and analyze them
    • During extraction, we use IRIs with 'local' domains (e.g., http://en.dbpedia.org/resource/Bo÷tes) and, if desired, convert them to 'generic' domain and/or URI syntax (e.g., http://dbpedia.org/resource/Bo%C3%B6tes) during serialization
    • To allow extracting data from multiple Wikipedia dumps for a language, the extraction uses the same folder and file name structure as the Wikipedia dump files
    • Download code is more flexible, configurable, and intelligent. It semi-automatically checks for new dumps, and only downloads files that have changed.
    • Cleaner Extractor class hierarchy
    • Extraction steps that looked for links between different Wikipedia editions were replaced by more powerful post-processing scripts
    • Article count ranges (e.g., “10000–100000”), can now be used in all configuration files as an alternative to language lists, e.g., “en,de,fr
    • Introduced a simple Scala DSL for simple but efficient parsing and transformation of XML, based on the Streaming API for XML.
    • Reduced use of Scala default arguments; they more often than not lead to errors of omission
    • Reduced use of Scala Option, which often leads to code bloat without adding benefits
    • Using more mutable collections and fewer wrapper classes improves performance and simplifies code
  • Data / Internationalization:
    • Mapping-based datasets are now also available for Arabic, Basque, Bengali, Bulgarian, Czech, and Korean
    • New language chapters: French, Italian, and Japanese.
    • As percent-encoding is strongly discouraged by the RDF specification, DBpedia's URI encoding avoids it as much as possible
    • We now use IRIs for all languages except English, which uses URIs for backwards compatibility
      • Turtle files (.ttl, .tql) always use IRIs, even for English
    • New dataset with sameAs links between URI and IRI versions of a resource to improve backwards compatibility and compatibility between chapters
    • new inter-language link datasets for all languages:
      • interlanguage_links contains all inter-language-links from one Wikipedia edition to others
      • interlanguage_links_same_as contains bijective inter-language-links, using propery owl:sameAs
      • interlanguage_links_see_also contains non-bijective inter-language-links, using propery rdfs:seeAlso
    • Internationalized dataset files are located in the main directory for a language, not in a separate -i18n folder as in release 3.7, and their names are directly derived from the dataset name.
    • Canonicalized dataset files are also located in the main directory for a language, but their names are constructed by appending _en_uris to the dataset name.
    • In addition to N-Triples (.nt) and N-Quads (.nq) files, we also offer the datasets in Turtle (.ttl) and 'Turqle' (.tql) formats
    • New dataset with transitive redirects – multiple redirects have been resolved, redirect cycles have been removed
    • We now resolve redirects in all datasets where the objects are also DBpedia resources
      • For cases in which the original, unredirected link target is needed, we also offer dataset files named '_unredirected'.
  • Bugfixes:
    • Fixed handling of URIs/IRIs
    • When extracting raw infobox properties, make sure that predicate URI can be used in RDF/XML by appending an underscore if necessary
    • Check that extracted date is valid (e.g., February never has 30 days) and its format is valid according to its XML Schema type, e.g., xsd:gYearMonth
    • Remove HTML character entities in abstracts
    • Parser now accepts space as grouping separator in most numbers
    • Linksets: fix several URI and N-Triples encoding bugs
    • Freebase links: use correct syntax for Freebase RDF URIs
    • Fixed remaining N-Triples / Turtle encoding bugs
    • Fixed thread-safety bug in date and number format parsers that caused values to be mixed between extracted pages
    • Several Scala classes contained code to find the related classes (equivalent and base classes) of an ontology class. Moved this code to one class and fixed bugs.
    • Fixed bug in download code that lead to broken files
    • page_ids and revision_ids datasets now use the DBpedia resource, not the Wikipedia page URL, as subject URI 
    • use foaf:isPrimaryTopicOf instead of foaf:page for link from DBpedia resource to Wikipedia page
    • In Java, the upper case for '' is 'SS', but MediaWiki titles may start with ''. Similar for some ligatures like ''. We no longer convert these to upper case if they appear as the first character in DBpedia URIs. (Some datasets were extracted before this bug was fixed.)
    • Redirects extracted from older Wikipedia dumps were checked into the source code repository. This led to subtle and surprising bugs, so we removed them. Redirects are now extracted on demand and then stored in the extraction directory.
    • Decode HTML-encoding and URI-encoding in Wikipedia link targets
  • Ontology:
    • Since the DBpedia 3.7 release, the DBpedia community added 40 classes and 132 properties on the mappings wiki. The DBpedia 3.8 ontology encompasses 359 classes and 1775 properties (800 object properties, 859 datatype properties using standard units, 116 datatype properties using specialized units)
  • Mappings:
    • Mappings for several new languages: Arabic, Basque, Bengali, Bulgarian, Czech, Hindi, Japanese, Korean. (Mapping-based datasets are not available for Hindi and Japanese because when the 3.8 extraction ran, there weren't any mappings yet for these languages.)
    • Major improvements in server responsiveness by keeping more data in memory, especially for statistics
    • Users can specify revision ID, as well as page title, when extracting samples from Wikipedia pages
    • Radically reduced effort required to add a mapping namespace for a new language by removing duplicated code in several places
    • New mappings wiki syntax for labels and comments to make adding a mapping namespace for a new language easier
  • Links to external datasets:
    • Created new Freebase links based on Freebase dump from June 2012

DBpedia 3.7 (08/2011)

  • Framework:
    • Redirects are resolved in a post-processing step for increased inter-connectivity of 13% (applied for English datasets)
    • Extractor configuration using the dependency injection principle
    • Simple threaded loading of mappings in server
    • Improved international language parsing support thanks to the members of the Internationalization Committee
  • Bugfixes:
    • Encode homepage URLs to conform with N-Triples spec
    • Correct reference parsing
    • Recognize Media Wiki parser functions
    • Raw infobox extraction produces more object properties again
    • skos:related for category links starting with “:" and having and anchor text
    • Restrict objects to Main namespace in Mapping Extractor
    • Double rounding (e.g., a person's height should not be 1800.00000001 cm)
    • Start position in abstract extractor
    • Server can handle template names containing a slash
    • Encoding issues in YAGO dump
  • Ontology :
    • owl:equivalentClass and owl:equivalentProperty mappings to http://schema.org
    • Note that the ontology now is a directed-acyclic graph. Classes can have multiple superclasses, which was important for the mappings to schema.org. A taxonomy can still be constructed by ignoring all superclass except the one that is specified first in the list, which is considered the most important.
  • Mappings:
  • Links to external datasets:
    • New links: Umbel, EUnis, Linked MDB, (geospecis)
    • Updates for: Freebase, Word Net, Opencyc, New York Times, Drugbank, Diseasome, Flickrwrapper, Sider, Factbook, DBLP, Eurostat, Dailymed, Revyu

DBpedia 3.6 (01/2011)


DBpedia 3.5.1 (04/2010)

DBpedia 3.5 (04/2010)

  • The DBpedia extraction framework has been completely rewritten in Scala. The new framework dramatically reduces the extraction time of a single Wikipedia article from over 200 to about 13 milliseconds. All features of the previous PHP framework have been ported. In addition, the new framework can extract data from Wikipedia tables and is able to extract multiple infoboxes out of a single Wikipedia article. The data from each infobox is represented as a separate RDF resource. All resources that are extracted from a single page can be connected using custom RDF properties which are defined as part of the extraction mappings.
  • The mapping language that is used to map Wikipedia infoboxes to the DBpedia Ontology has been redesigned.
  • In order to enable the DBpedia user community to extend and refine the infobox to ontology mappings, the mappings can be edited on the newly created wiki hosted on http://mappings.dbpedia.org/.
    • At the moment, 303 template mappings are defined, which cover (including redirects) 1055 templates.
    • On the wiki, the DBpedia Ontology can be edited by the community as well. At the moment, it contains 259 classes and about 1,200 properties.
  • The ontology properties extracted from the infoboxes are now split into 2 data sets. (For details, see the Datasets page.)
    • The Ontology Infobox Properties dataset contains the properties as they are defined in the ontology (e.g., length). The range of a property is either a xsd schema type or a dimension of measurement, in which case the value is normalized to the respective SI unit.
    • The Ontology Infobox Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit (e.g., the property height is specialized on the class Person using the unit centimetres instead of metres).
  • The framework now resolves template redirects, making it possible to cover all redirects to an infobox on Wikipedia with a single mapping.
  • In addition to the new mapping based extractor, three new extractors have been implemented:
    • Page Id Extractor extracting Wikipedia page IDs are extracted for each page
    • Revision Extractor extracting the latest revision of a page
    • PNDExtractor extracting PND (Personnamendatei) identifiers
  • The data set now provides labels, abstracts, page links, and infobox data in 92 different languages, which have been extracted from recent Wikipedia dumps as of March 2010.
  • In addition to the N-Triples datasets, N-Quads datasets are provided which include a provenance URI for each statement. The provenance URI denotes the Wikipedia origin of the extracted triple. (For details, see the Datasets page.)
  • The URI for extended abstracts was changed from http://dbpedia.org/property/abstract to http://dbpedia.org/ontology/abstract.

DBpedia 3.4 (11/2009)


  • update to Wikipedia dumps generated 2009 September 24–29
  • number of ontology instances (mapping based extraction) rose to 1.17 million, overall number of instances to 2.9 million.
  • labels, abstracts, generic template properties etc. are available in 91 different languages now (see http://downloads.dbpedia.org/3.4/)
  • introduced strict and loose infobox ontology in order to meet different requirements of client applications (see http://wiki.dbpedia.org/Datasets#h18-11)
  • As Wikipedia has moved to dual-licensing, we also dual-license DBpedia. DBpedia 3.4 is thus licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License
  • also use templates that are redirects to known templates
  • improved template to ontology mapping
  • normalize Wikipedia units of measurement to ontology units of measurement
  • ontology uses owl:Thing instead of Resource
  • faster and more accurate abstract extraction, less redundant code
  • store Wikipedia dump timestamps
  • fixed some problems in importing Wikipedia dump into MySQL
  • abstract extraction now needs fewer patches in the Wikipedia installation
  • abstract extraction now needs one Wikipedia installation, not one per language
  • fixed redirect extractor
  • fixed some encoding problems
  • made extraction more resilient to MySQL problems
  • bugfix in link extractor: don't extract empty links
  • converted mapping configuration from Excel to CSV files
  • fixed some bugs in mapping.csv
  • added mappings for more templates and classes (cars, space stuff, awards, species, ...)
  • re-implemented SQL iterator classes to make them more flexible while minimizing code redundancy
  • handle 'typed' templates like {{Geobox|River}}
  • detect redirect and disambig pages using magic words for each language
  • fixed our use of foaf:img / foaf:depiction
  • generate xsd:gYear, xsd:gMonth, ... values

DBpedia 3.3 (07/2009)


  • update to Wikipedia dumps generated in May 2009
  • more accurate abstract extraction
  • labels and abstracts in 80 languages (see http://downloads.dbpedia.org/3.3/)
  • infobox extraction bugfixes
  • new links to Dailymed, Diseasome, Drugbank, Sider, TCM 
  • updated Open Cyc links

DBpedia 3.2 (11/2008)


  • update to Wikipedia dumps generated in October 2008
  • new DBpedia Ontology introduced.
  • new infobox extraction framework to populate the ontology.
  • added initial infobox to ontology mappings.
  • datatype extraction code improved.
  • abstract extraction code improved.
  • new Freebase to DBpedia RDF links.
  • updated Open Cyc to DBpedia RDF links.

DBpedia 3.1 (08/2008)

  • update to Wikipedia dumps generated in June 2008
  • YAGO mapping improvements:
    • YAGO itself has improved their algorithms
    • the DBpedia-YAGO mapping is now generated by a YAGO converter and should have much better quality compared to the previous release
  • Geo Extractor improvements:
  • Bugs fixed:
    • #1871653 Too long URIs by infoboxes extractor cause import problems
    • #1964434 illegal \ char in URIs
    • #1970387 Filter out references
    • #1964632 Illegal URIs
    • #1947512 timespan extraction
    • Fixed unwanted MySQL connection pooling and corrected database names for infobox and image extractors
    • Fixed internal encoding of international page IDs

DBpedia 3.0 (02/2008)

DBpedia 3.0 comes with the following changes (includes those changes between DBpedia 2.0 and DBpedia 3.0RC):

  • multi-language improvements: extractors now applied to up to 14 different languages (not all extractors work on all languages)
  • redirects data set available
  • image copyright issues:
    • the image extractor tries not to extract non-free images anymore (however, we cannot guarantee that it will not still happen)
    • most of the extracted image URLs now contain an additional triple: $image dc:rights $wikiPageDescribingRights; always link back to the corresponding wiki page if you use images in your DBpedia based applications
  • experimental (and still buggy) alternative DBpedia class hierarchy system:
    • close to the Wikipedia category system but with several filters applied to it (categories which are bad candidates for OWL classes are to some degree filtered out, circles in the hierarchy are removed, administrative categories removed, etc.)
  • improvements in extraction code:
    • package structure in extraction code improved
    • new Global Extractor Interface for non-article dependent extractions
    • URI Exception for erroneous URIs
  • new Linked Data Sets available:
    • Links to Cyc
    • Links to the flickr wrappr
    • Links to Wikicompany
  • Bugs fixed:
    • #1818011: Labels for resources with colon character
    • #1793163: HTML linebreaks are lost
    • #1829160: Incorrect assignment of pages to categories
    • #1819301: Missing plural redirects
    • #1814938: Duplicates in pagelinks
    • #1797810: Persondata dump should be labeled as German
    • #1813011: Extra label in category wiki links
    • #1871653: Too long URIs by infoboxes extractor cause import problems
    • #1817019: Incorrect capitalization for XML Schema Datatypes
    • #1730445: DBpedia browser page title = "テレビプロデューサー"
    • #1724322: rudi v÷ller – 404 links
    • #1722279: Language code within Chinese Abstracts
    • URIs with leading digit escaped by _
    • Person Data Extractor: wrong date format (leading 0)
    • Triples with over-sized erroneous URIs will not be extracted
    • Incorrect assignment of pages to categories
    • ... and many more ...
  • Feature Requests incorporated:
    • Extraction from Disambiguation Pages
    • Extraction from Redirect Pages
    • #1860862 Ordering of given- and surname in Personendaten Extractor

DBpedia 2.0 (09/2007)

  • Improved the Data Quality
  • Third Classification Schema Added: concepts are now also classified by associating them to Word Net synsets
  • Geo-Coordinates: data set contains geo-coordinates for geographic locations using the W3C Basic Geo Vocabulary
  • RDF Links to other Open Data Sets: The data set now contains 440,000 external RDF links into the 
    • Geonames,
    • Musicbrainz,
    • Word Net,
    • World Factbook,
    • Euro Stat,
    • Book Mashup,
    • DBLP Bibliography, and 
    • Project Gutenberg data sets.

DBpedia 1.0 (03/2007)

Initial Release of the DBpedia Data Sets, including:


  • better short abstracts (stuff like unnecessary brackets has been removed from the abstracts)
  • new extended abstracts for each concept (up to 3000 characters long)
  • abstracts in 10 languages (German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese, and Chinese)
  • 2.8 million new links to external Web pages
  • Cleaner infobox data
  • 10,000 additional RDF links into the Geonames database.
  • 9000 new RDF links between books in DBpedia and data about them provided by the RDF Book Mashup
  • 200 RDF links between computer scientists in DBpedia and their publications in the DBLP database
  • New classification information for geographic places using DBpedia terms and Geonames feature codes

 
There are no files on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2014-04-11 19:11:20 by Ted Thibodeau Jr