A Simple Guide for Data Publishing Infrastructure Providers

As an infrastructure provider, facilitating data sharing is vital for enhancing research accessibility and collaboration within the scientific community. This guide outlines essential recommendations and considerations for improving data publishing practices in your repositories. It takes into consideration that most researchers publish their research data alongside an article published in an academic journal.

This guide is based on our standards for data publishing for infrastructure providers. You can view the full list of standards at the end of this article.

a flowchart for data publishing for infrastructure providers

1. Metadata Should Be Part of a Dataset

Research data repositories should include standardised, machine-readable metadata in datasets downloaded by researchers and exchanged with other resources. Generic and technical metadata are associated with the dataset during its upload process, while domain-specific repositories extract metadata from analytical data files provided by researchers along their lab workflows (e.g., Chemotion ELN). Once data is retrieved, this metadata should remain intact in the downloaded package, with descriptive DataCite metadata as a minimum requirement. One solution to ensure reliable file transfer is to use BagIt, which enables inclusion of metadata in downloaded datasets.

2. Incorporate Structured Domain-Specific Metadata

Research data repositories should incorporate structured, domain-specific metadata in datasets downloaded by researchers and exchanged with other resources. Beyond generic schemes like DataCite's metadata schema, including Schema.org metadata through tools like RO-Crate or combining RO-Crate with BagIt enhances both reliability and richness of metadata. By including information on context of a particular domain, domain-specific metadata enhance data relevance, accuracy, and consequently re-usability.

3. Inform Authors on Legal Issues on Dataset Abstracts

Repositories should inform authors about possible rights conflicts when using related article abstracts as descriptions within dataset fields; separate descriptions are recommended instead due copyright concerns.

4. Group Licenses According to Type

Licenses offered by research data repositories should be grouped into categories distinguishing between research data licenses and software licenses; this simplifies selection processes for users navigating lengthy lists.

5. Encourage Creative Commons Licenses

To streamline license selection, infrastructure providers should direct authors towards opting for Creative Commons licenses when publishing research data—licensing options no more restrictive than CC BY are strongly recommended.

6. Promote Least Restrictive Creative Commons Licenses

To foster openness within shared datasets, repositories ought to suggest licenses on handling data such as CC0 or CC BY over more restrictive alternatives like CC BY-SA or CC BY-NC-ND which may inhibit reuse opportunities; shared works ought to be as open as possible, allowing others maximum freedom in utilization efforts!

7. Creators, Authors, and Contributors

Researchers typically refer to individuals involved in the publication of results as authors, particularly when working with scientific publishers. However, DataCite differentiates between authors and contributors, with contributors being assigned specific roles. To minimize confusion for researchers publishing their data, repositories should assist by labeling the field for creators in their (DataCite) metadata editors as Authors/Creators.

8. Include Publisher Information in DataCite Metadata

Research data repositories should include their name as 'publisher' along with a 'publisherIdentifier' in each dataset's DataCite metadata fields automatically populated by the repository system without user edits allowed.

Example:

Publisher: RADAR4Chem

publisherIdentifier: http://doi.org/10.17616/R31NJNAY

publisherIdentifierScheme: re3data

schemeURI: https://re3data.org/

In XML:

<publisher xml:lang="en" publisherIdentifier="http://doi.org/10.17616/R31NJNAY" publisherIdentifierScheme="re3data" schemeURI="https://re3data.org/">RADAR4Chem</publisher>

This ensures clear identification of where each dataset has been published both for humans and machines alike.

9. Provide Collection DOIs to Wrap Multiple Datasets

Research data repositories should provide a Collection DOI that wraps relevant research data objects associated with a single article intended for publication. While field-specific repositories may offer DOIs for individual chemical reactions or molecules, multidisciplinary repositories provide DOIs for entire published datasets. To facilitate manuscript submission processes, each repository should enable authors to generate a Collection DOI that encompasses pertinent data referenced in their data availability statements.

10. Provide Access to Read-Only Datasets Under Review

To facilitate inclusion of datasets in manuscript review processes, research data repositories should provide access to datasets under review through an accessible URL while maintaining non-editable status until the review has been completed.

11. Utilize URLs for Reviewer Access

URLs accessing datasets under review should have access credentials encoded within them rather than requiring separate login information—this avoids potential errors during communication between reviewers and submission systems.

12. Allow Metadata Corrections and Updates

Research data repositories should permit researchers to correct and update metadata due to potential errors during initial entry—Data/Metadata versioning may enhance transparency regarding changes made over time while enriching overall dataset descriptions according to FAIR principles (F2. This also llows authors to update datasets with correspnding article DOIs in cases where the data has been published before article acceptance or publication.

13. Provide Access to Metadata During Embargo Period

Datasets published under an embargo period should restrict access to the dataset while ensuring that its metadata remain accessible—publish with embargo. This practice allows relevant information about the dataset to be retrievable, in accordance to FAIR principles (A2, while granting authors first rights to their data.

14. Contribute to and Use Scholix.org Framework

Research data repositories should contribute to and utilize Scholix.org as it provides a framework for improving links between scientific literature and research data across various digital objects. Scholix hubs such as DataCite or OpenAire contribute valuable information on connected digital objects, allowing academic publishers to discover corresponding datasets even after articles have been published.

Resources and Further Reading

NFDI4Chem - Deliverable D3.3.1: Gap analysis report for selected repositories
CoreTrustSeal Requirements 2023-2025
COAR Community Framework for Good Practices in Repositories, Version

Standards

Research data repositories should include the metadata in datasets downloaded by researchers and exchanged with other resources.
Research data repositories should include structured, domain-specific metadata in datasets downloaded by researchers and exchanged with other resources.
Research data repositories should inform authors and data curators about possible rights conflicts for the abstract field in the datasets DataCite metadata.
In research data repositories, licences should be grouped into research data licences and software licences.
Research data repositories should encourage researchers to choose a Creative Commons licence to simplify the landscape of licences and their choice.
Research data repositories should suggest licences such as CC0 or CC BY by pre-selecting such a licence rather than more restrictive licences such as CC BY-SA or even CC BY-NC-ND, which can inhibit reuse.
Research data repositories should label fields for creators in their (DataCite) metadata editor as* Authors/Creators.
Research data repositories should provide the repository name as 'publisher' as well as a 'publisherIdentifier' in a dataset's DataCite metadata.
Research data repositories should provide a Collection DOI to wrap research data objects that are relevant to a single article that is to be published.
Research data repositories should provide access to datasets under review.
URLs to access datasets under review should have the access credentials encoded within the URL.
Research data repositories should allow researchers to correct and update the metadata of datasets.
Datasets published with an embargo period should have inaccessible data but accessible metadata.
Research data repositories should contribute and use to Scholix.org.

Main authors: ORCID:0000-0003-4480-8661, ORCID:0000-0002-6243-2840

1. Metadata Should Be Part of a Dataset​

2. Incorporate Structured Domain-Specific Metadata​

3. Inform Authors on Legal Issues on Dataset Abstracts​

4. Group Licenses According to Type​

5. Encourage Creative Commons Licenses​

6. Promote Least Restrictive Creative Commons Licenses​

7. Creators, Authors, and Contributors​

8. Include Publisher Information in DataCite Metadata​

Example:​

9. Provide Collection DOIs to Wrap Multiple Datasets​

10. Provide Access to Read-Only Datasets Under Review​

11. Utilize URLs for Reviewer Access​

12. Allow Metadata Corrections and Updates​

13. Provide Access to Metadata During Embargo Period​

14. Contribute to and Use Scholix.org Framework​

Resources and Further Reading​

Standards​