For Infrastructure Providers
Metadata should be part of a dataset
Research data repositories should include the metadata in datasets downloaded by researchers and exchanged with other resources.
Generic and technical metadata is attached to the dataset during the upload process. For generic, multidisciplinary repositories, researchers provide additional metadata via a metadata editor. For field-specific repositories, the metadata is extracted from the analytical data files and provided by researchers along their lab workflows, as this is the case for Chemotion ELN. Once data is retrieved from a repository, this metadata should not be lost but should be included in the downloaded package. The minimum to include is the descriptive DataCite metadata.
BagIt, a set of hierarchical file system conventions, is one solution to enable reliable file transfer and to include metadata in downloaded dataset, as this is already the case for RADAR.
Structured domain-specific metadata should be part of a dataset
Research data repositories should include structured, domain-specific metadata in datasets downloaded by researchers and exchanged with other resources.
Besides metadata following generic schemes such as DataCite's metadata scheme, domain-specific metadata should be part of each dataset. This metadata should also be provided in datasets downloaded by researchers for reuse or exchanged with other resources.
One solution to this is to include Schema.org metadata making use of RO-Crate or by even combining RO-Crate and BagIt. While BagIt focusses on reliable transfer, RO-Crate is about rich metadata.
Collection DOIs
Research data repositories should provide a Collection DOI to wrap research data objects that are relevant to a single article that is to be published.
Field-specific research data repositories may provide DOIs to reference individual chemical reactions, molecules, and their analytical data. Generic, multidisciplinary research data repositories provide DOIs for whole published datasets, while more than one published dataset may be relevant to study results published via an article. In other words, many DOIs may be relevant to a published article, whereas a data availability statement may provide some DOIs but not many DOIs. To facilitate the process of manuscript submission and article publication, each repository should allow authors to generate a Collection DOI that wraps relevant data that should be referenced in the data availability statement.
Embargo period and metadata accessibility
Datasets published with an embargo period should have inaccessible data but accessible metadata.
While an embargo period restricts access to a published dataset, the metadata should be accessible—publish with embargo. This allows relevant information on the dataset to be retrieved and enables FAIR (A2), while guaranteeing first rights to the data to the authors. Findable and accessible metadata is required to link articles and datasets, as DOIs to datasets need to be validated before the metadata record of articles gets updated.
Scholix.org
Research data repositories should contribute and use to Scholix.org.
Scholix provides the framework for improving the links between scientific literature and research data as well as between data and data with the goal of providing a high-level interoperability framework for exchanging information about these links. Thus, Scholix hubs, such as DataCite or OpenAire, contribute information on their metadata records, which contain information on connected digital objects. This information is used by academic publishers to discover datasets that correspond to an article but were published after the article was published, which allows the metadata of the article to be updated with links to the dataset. Vice versa, repositories should use Scholix.org to find related datasets and articles and to set links to datasets published in their own infrastructure by updating the metadata of such datasets with related identifiers.
Datasets under review
Research data repositories should provide access to datasets under review.
In order to include datasets in the review process of manuscripts, repositories should provide access to reviewers. It is strongly encouraged that repositories provide a status in review in addition to the statuses draft and published. A dataset under review should not be editable but should be accessible via a URL, while the DOI is not yet discoverable and the DOI metadata is not yet accessible as the dataset has not yet been published. Such datasets under review might be assigned with an (internally) reserved DOI.
Reviewer access to datasets under review
URLs to access datasets under review should have the access credentials encoded within the URL.
Review links for datasets under review should have the access credentials encoded in the URL, rather than having separate login and password. This avoids the need to send this information to the reviewer via the submission system or letter to the editor, which has already been shown to be an error-prone process.
Publisher and PublisherIdentifier in DataCite Metadata
Research data repositories should provide the repository name as 'publisher' as well as a 'publisherIdentifier' in a dataset's DataCite metadata.
DOI prefixes are specific to each registrant. As each registrant may host more than one repository, the prefix is not necessarily specific to an individual repository. As stated by DataCite's documentation on the publisher field, the required publisher
field is used to formulate the citation of the data and should therefore be the repository name. In addition, a repository identifier such as the re3data identifier should be provided to unambiguously identify the repository. Both fields should be automatically populated by the repository and should not be editable by the submitter.
In the case of RADAR4Chem the relationship would be described as follows:
Publisher: RADAR4Chem
publisherIdentifier: http://doi.org/10.17616/R31NJNAY
publisherIdentifierScheme: re3data
schemeURI: https://re3data.org/
In XML:
<publisher xml:lang="en" publisherIdentifier="http://doi.org/10.17616/R31NJNAY" publisherIdentifierScheme="re3data" schemeURI="https://re3data.org/">RADAR4Chem</publisher>
This ensures humans as well as machines can trace and interpret where the data has been published.
Metadata corrections and updates
Research data repositories should allow researchers to correct and update the metadata of datasets.
As the process of adding metadata via a metadata editor can be error-prone, creators should be allowed to correct and update the PID metadata as well as any other metadata. For full transparency, metadata may be versioned. The record may indicate to viewers that changes were made. Updating metadata contributes to FAIR (F2) by enhancing the richness of metadata, and also allows creators to add additional related identifiers for recently published related datasets and articles.
Legal issues on dataset abstracts
Research data repositories should inform authors and data curators about possible rights conflicts for the abstract field in the datasets DataCite metadata.
Dataset's should have their own description as an abstract. Copying the related article abstract can result in a rights conflict with the respective academic publisher. This is particularly the case in German law. Authors and data curators should be informed of this conflict, e.g. via a tooltip when adding metadata using a metadata editor.
Group licences
In research data repositories, licences should be grouped into research data licences and software licences.
Repositories often tend to provide confusingly long lists of licences. Furthermore, licences differ for datasets and software. Providing a grouping will greatly assist authors in selecting from the correct set.
Encourage Creative Commons licences
Research data repositories should encourage researchers to choose a Creative Commons licence to simplify the landscape of licences and their choice.
In line with grouping licences between software and datasets, pointing authors towards the selection of a Creative Commons (CC) licence alleviates the selection process. When publishing research data, Creative Commons licences that are no more restrictive than CC BY are strongly recommended.
Promote the least restrictive Creative Commons licences
Research data repositories should suggest licences such as CC0 or CC BY by pre-selecting such a licence rather than more restrictive licences such as CC BY-SA or even CC BY-NC-ND, which can inhibit reuse.
Shared data should be as open as possible, as closed as necessary with the intent of allowing and enabling others to reuse and build upon the work. Thus, encouraging rather open licences such as CC0 and CC BY as opposed to more restrictive licences such as CC BY-SA or even CC BY-NC-ND. Such restrictive licences can make data reuse difficult.
Creators and contributors
Research data repositories should label fields for creators in their (DataCite) metadata editor as Authors/Creators.
Researchers are used to talking about authors when it comes to the publication of results with one of the scientific publishers. On the other hand, DataCite distinguishes between authors and contributors, with the latter also being assigned a role. To avoid confusion for researchers who want to publish their data, repositories should guide researchers when adding metadata by labelling the field for creators in their (DataCite) metadata editors as Authors/Creators.
Main authors: ORCID:0000-0003-4480-8661, ORCID: 0000-0002-6243-2840