Data Management Plan extension of DataID

January 26, 2016 by Markus Freudenberg

Over the last years Data Management Plans (DMP) have become a required part of research project proposals with most major research funding institutions. It lines out how to handle data both during research projects, and after the project is completed. The definition of a suitable meta-data format used for (digital) datasets is an important requirement of a DMP. DataID offers a comprehensive meta data format to describe datasets and all their manifestations.

This use case will introduce an extension to the DataID ontology to extensively describe a Data Management Plan for digital data in a universal way. Our goal is to provide aid for researchers in drafting and implementing a DMP: during the proposal phase, while the project is ongoing and the long term implementation of the DMP. Funding organisations are provided with special rights in regard to the DMP meta data together with updates on the status of its implementation. All parties will have access to compact documentation automatically generated from the DataID meta data, which is kept up to date by the DataID versioning system.

The ongoing ontology development of this extension can be followed here: GitHubPage. You can also raise your issues with the ontology there.

Status, Limitation and Adoption

The DataId project is actively maintained and developed. For the Data Management Plan subproject of DataID (DataID-DMP) we have implemented a proof-of-concept version for H2020, which is working and used by several EU projects. If you are interested in an extension of data management plans for other funding agencies, please contact M. Freudenberg.

 

Guidlines and Checklists

We distilled a shortlist of requirements of a DMP using the following guidelines and checklists of different organisations:

Requirements of a Data Management Plan

The following requirements were distilled from an extensive list of DMP guidelines and checklists of different research funding bodies (e.g. H2020, NSF etc.), covering most of the raised demands for a DMP pertaining to digital datasets. Our approach to solve these requirements is detailed later in this document.

  1. Describe how data will be shared, including repositories and access procedures and embargo periods (if any).

  2. Describe the procedures that will be put in place for long-term preservation of the data.

  3. Describe the types of data and metadata, as well as identifiers used.

  4. Provide copyright and license information, including other possible limitations to the reusability of the data.

  5. Outline the rights and obligations of all parties as to their roles and responsibilities in the management and retention of research data.

  6. Provision for changes in the hierarchy of involved agents and responsibilities (e.g. a Primary Investigator (PI) leaving the project).

  7. Provide progress reports in regard to the implementation of the DMP.

  8. Include provenance information on how datasets were used, collected or generated in the course of the project. Reference standards and methods applied.

  9. Include statements on the usefulness of data for the wider public needs or possible exploitations for the likely purposes of certain parties.

  10. Provide assistance for dissemination purposes of (open) data, making it easy to discover it on the web.

  11. Is the metadata interoperable allowing data exchange between different meta data formats, researchers and organisations?

  12. Project costs associated with implementing the DMP during and after the project. Justify the prognosticated costs.

  13. Support the data management life cycle for all data that will be collected, processed or generated.

 

re3data.org

The listed requirements make it obvious that a detailed description of repositories, which retain the described data, will be central to the DMP metadata description. For this reason we initiated a cooperation with re3data.org, to make use of the thourough schema for repositories. 

The re3data registry currently lists over 1.500 research repositories, making it the largest and most comprehensive registry of data repositories available on the web. By providing a detailed metadata description of repositories, the registry helps researchers, funding bodies, publishers and research organisations to find an appropriate data repository for different purposes. Initiated by multiple German research organisations, funded by the German Research Foundation from 2012 until 2015, re3data is now a service of DataCite. In 2014 re3data merged with the DataBib registry for research data repositories into one service.

To include the XML based schema of re3data in our DataID-DMP extension we had to transform it into an ontology, introducing an semantic interface with the DCAT vocabulary. This offeres the necessary means to link back to dataset descriptions of the datasets sored inside a repository.

DMP-DataID Ontology

The first draft of the DMP Ontology is available from http://dataid.dbpedia.org/ns/dmp#

DataID core, the Activities & Plans extension and the re3data ontology are the foundational components of the DMP-DataID extension. On top of which we added additional semantics, solving the requirements listed above.


 

Implemantation and Resolution of Requirements

The table below answers all requirements raised in the context of DataID-DMP.
Requirement Solution

Describe how data will be shared, including repositories and access procedures and embargo periods (if any).

A description of repositories involved in a DMP is provided by the concept r3d:Repository, including exact documentation of APIs and access procedures. More detailed information on the type of data or additional software necessary to access the data, was introduced with dataid:Distribution (also see next Requirement).

Describe the procedures that will be put in place for long-term preservation of the data.

Multiple dmp:PreservationPlam entities can describe different approaches for preservation of different datasets or provide temporal scaling (e.g. regarding embargo periodes). Besides textual statements about general goals and provisions for security and backup, using the dataid-acp:planned property to point out specific tasks, put in place to preserve data long term, is one of the more notable provenance information provided with DMP.

Describe the types of data and metadata, as well as identifiers used

DataID itself is a rich meta format for describing any kind of data in RDF, providing the possibility of linking the meta data description with any other resource on the web. In addition to the unique URI of a DataID, any additional identifier might be added to the description. This requirement is therefore intrinsic to DataID.

Provide copyright and license information, including other possible limitations to the reusability of the data.

As in DataID core, information about licenses and other limitations are provided via dct:license and dct:rights, or the complementary properties of the re3data ontology concerning access and other policies. The license / policy vocabularies of ODRL and METASHARE, are very descriptive for the expression of any limitation which applies.

Outline the rights and obligations of all parties as to their roles and responsibilities in the management and retention of research data.

Modeling a DMP with DataID allows very detailed descriptions of Agents (persons, organisations etc.) involved, their role in regard to a dataset and the rights and responsibilities their roles entail. Additional roles, rights, responsibilities and events are added with the DMP ontology to accommodate the unique use case of a DMP. A hirachical structure of AgentRoles and Responsibilities provides additional semantics. (more on the subject of Agents and Responsibilities: DataID documentation)

Provision for changes in the hierarchy of involved agents and responsibilities (e.g. a Primary Investigator (PI) leaving the project).

Contingency plans are addressed by a hierarchy of agent roles, time restrictions of an Authorization an agent has (if necessary) and direct substitute regulations.

Provide progress reports in regard to the implementation of the DMP.

The integrated versioning system of DataID keeps track of changes to the datasets, as well as the meta data. In the case of DMP-DataID it documents changes to the DMP (preservation, repositories etc.) as well. An overview of these changes and the ongoing implementation will be provided to granter and grantee.

Include provenance information on how datasets were used, collected or generated in the course of the project. Reference standards and methods applied.

Extensive documentation of provenance is ensured by the usage of the PROV-O vocabularyand the concepts and properties introduced by the Activities & Plans extension of DataID, providing the means for describing sources and origin activities of datasets (e.g. how dataset was created, who was involved and in what capacity, which tools were instrumental to that process etc.). Standards which were observed in these processes can be linked to via dct:conformsTo. 

Include statements on the usefulness of data for the wider public needs or possible exploitations for the likely purposes of certain parties.

Helpful information on usefulness, reusability and other subjects for possible users of the portrayed datasets are added to the dataid:Dataset concept: dmp:usefulnessdmp:reuseAndIntegrationdmp:exploitation etc. 

Provide assistance for dissemination purposes of (open) data, making it easy to discover it on the web.

The dmp:DataManagementPlan concept provides the most general level of textual statements about the DMP itself or the planned dissemination process, as well es the necessary references to pertaining projects. By using DataID publishers have the ability to present their data directly on the DataID portal. Thereby making the meta data subject to search engine crawlers. Optionally, all dataset meta data can be pushed directly to datahub.io as an additional representation on a well known web site, used by many researchers for publishing datasets.

Is the metadata interoperable allowing data exchange between different meta data formats, researchers and organisations?

By modularizing the DataID ontology we solved the conundrum posed by the seemingly contadictory aspects of EXTENSIBILITY and INTEROPERABILITY of metadata schemas. ever expanding requirements of a metadata schema when describing data for different domains and usages (see paper). Since DataID is represented in RDF and therefore is part of the Linked Data Web, interchange between different DataIDs and the data they represent is a foundational principle of the whole DataID endeavour.

Project costs associated with implementing the DMP during and after the project. Justify the prognosticated costs.

The concept dmp:BudgetItem is an optional tool to list costs pertaining to activities, responsibilities (consequently costs of agents) and any entity involved in a plan like dmp:PreservationPlan. Together with dmp:approxCost and dmp:justification it satisfies requirement.

Support the data management life cycle for all data that will be collected, processed or generated.

The data management life cycle is fully supported by the DataID life cycle (keywords: versioning, validation, representation, exploitation etc. see here).
 

 

DMP Deliverable auto-generation

A prototype DMP auto-generation toolset is already in place and applied in the H2020 ALIGNED project. In ALIGNED we use auto-generated documentation directly in the Data Management Plan deliverables as well as the website.

(please note: The DMP extract of ALIGNED makes use of an early concept version of the DMP ontology, which has extended significantly since.)

TODO add links to the deliverable

TODO add links to github https://github.com/dbpedia/dataid/tree/master/dmp (ideally two links, one to the dev version and a release tag)

TODO create and add a minimal example

 

Additional Benefits

Since every DMP-DataID is in its core also a regular DataID, every benefit derived from using DataID is extended to this use case. By making use of the existing DataID service stack and its API, DMP proposals will undergo the same validation process as normal DataIDs, thereby guaranteeing compliance to standards defined by this ontology. Drafting and updating of DMP-DataIDs will be supported by the DataID-Generator web site and the integrated versioning system. A DMP overview page produces a succinct summary of the current state of a DMP (its plans and the state of implementation) and will be visible to granter and grantee alike. Additional AgentRoles, Actions and custom DMP event types provide active messaging for Data Management Plan life cycle events (please refer to the DataID documentation for more details on Agents, Actions and Events).

 

Example: A chosen repository of a PreservationPlan is not accessible for a significant amount of time:

This triggers an urgent request to the maintainers of the affected datasets to supply an alternative download repository or provide a temporary justification for the problem. Non compliance with this request would consequently be regarded as a violation of the DMP and as such reported to the granter of the research project.

 

Contribution and road ahead

We are interested in any additional input regarding DMP documentation requirements and implementation ideas.

Please contact us via the DataID mailing list:
Subscribe | mailto:dbpedia-dataid@lists.sourceforge.net

After completing the draft phase of the DMP ontology we are going to extend the DataID Service and Generator page to include the DMP use case. This will allow researchers to state their DMP for digital data in detail and produce sufficient summaries (e.g. in PDF format) to supplement their project documentation.

By making extensive use of the PROV-O vocabulary we provide information about which activities generated, collected or processed a dataset. PROV-O also enables authors to describe standards used in these processes, which documents are involved (e.g. publications describing the procedure), which software or tool was instrumental and other useful information. While this is already possible with the DataID vocabulary itself, additional restrictions shall compel grantees to supply these important details. The class 'PreservationPlan' lines out the plans of projects in regard to the preservation and archiving of datsets during and after the project ended. Repositories provide vital information on how, where and for what period of time datasets are stored, as well as details on access restrictions, API summaries and much more. Additional properties about the licensing, usefulness, exploitation possibilities and a general statements about the openness of datasets (amongst others), provide the necessary context needed in DMP documents. A detailed list of all Classes and properties and what their role in the overall DMP is, will be provided when the ontology leaves the draft stage.

 

URL to the DBpedia Use Case: 

http://wiki.dbpedia.org/use-cases/data-management-plan-extension-dataid