Data format standard
Research data is the key to any science. As part of a FAIR practice, these data should be accompanied by valuable metadata and have to be exchanged in standardised and open formats. In chemistry these data include experimental parameters and measurement results, chemical structures, properties of compounds and descriptions of reactions. Whenever research data are published, stored, shared or reused, chemists have to decide on a format suitable for the purpose and consider the long-term stages of the data life cycle. Repositories and databases often accept specific formats to ensure comparability and completeness of data and metadata provided. If chemists are aware of data standards during data acquisition and research documentation, later conversions will be less challenging. Electronic lab notebooks can support the scientist in the early stages of data management.
Definitions
A successful data format standard is defined by the data model and the data representation. A specification of a representation may contain the model implicitly.
The Data model describes how data is organised, which information they contain, the data types (e.g. text, numbers, lists), the relationship between these components, and rules of component and data integrity. Conceptually, the model and its components can be organised in different ways, e.g. flat, multidimensional, as network or hierarchical. The meaning of the applied components has to be well described in an unambiguous way. Ontologies can be consulted to ensure a consistent usage of the components in the context of the model. As the data model is abstract, it is necessary to implement representations for it, such as a database or data file on a computer.
Digital files containing data are only one way to represent a data model. Other representations would be response data from API requests or results from a database query.
File formats as models can be categorised by different criteria: proprietorially vs. openly specified, binary vs. text, simple vs. complex or flat vs. n-dimensional. For long-term storage, a compact binary format may be the first choice, while for further processing with cheminformatics tools a format based on standards like Comma Separated Values (CSV), Extensible Markup Language (XML) or JavaScript Object Notation (JSON) may be more favourable due to support by most programming languages. When thinking about data strategies, open formats should be considered. Proprietary formats may prevent further reuse due to issues connected to licensing, poor documentation and support by the vendor. Complex data often need more complex formats to express the meaning and relation of data.
The specification of a data file standard may describe the whole format from scratch or can utilise generic formats, which are standardised anywhere else. Such generic formats include CSV, XML or JSON. For some of these generic formats, a formal specification format is available, called the schema format, which streamlines the specification of format derivatives. XML or JSON files can be described by schemas. If the format cannot be described formally by a general or existing schema, it is important to create an unambiguous new specification, clear in description of syntax, and grammar, as well as in definition of the components of the underlying data model. File formats should also be extendable to allow the usage of the format for new techniques and with future requirements. Data file standards not fulfilling these criteria may fork into incompatible formats or will be outdated soon and replaced by new standards.
Any software importing or exporting a file format has to include readers and/or writers for this format, often accompanied by a validator for checking documents for adherence to the standard. Integration of any format into software is more likely if developers may rely on an implementation, available as a library, package or module for their programming language or environment. Thus, acceptance of a format also depends on the availability and support of implementations. Formats are more sustainable if they have a free licence and are open source. For formats described by schemas there are often well-maintained implementations for the generic format including validation. So only the schema file has to be provided. Implementations for formats, which are not openly specified and lack vendor support, are sometimes only feasible by reverse engineering, which is debatable concerning legality. Only a complete and unambiguous specification allows well-crafted implementations.
Generic format standards
There are several generic formats to store data in a structured way. The format specifications do not define any domain specific elements, but some fields for document metadata, like a version, may be required to conform the specification. To create domain specific document formats, the elements including semantic descriptions have to be specified separately. Some of these formats have a corresponding schema format, which allows a formalised description of documents. Schemas describe the data types, relations and order of elements and attributes in the element. They also have fields for description of the data to add semantic meaning in the context of the format.
Comma separated values (CSV) is a simple ASCII Text format for tabular data. The values are stored as a single table with the columns separated by commas. The first line of the table may be interpreted as table header. It does not define any metadata, but these are often added as comments to the header. Although the specification does not mention comments, it is common to interpret line starting with a hash sign ("#") (or sometimes another character) as comment lines. There are variants of the format which use tabulators, semicolons or other characters to separate the values. Many vendors use the CSV format or variants of it as export format, because it is readable for humans and can be easily converted to other table formats like Excel sheets. Due to the lack of defined metadata and column descriptions it may be challenging to compare data from CSV files of different vendors as column titles or the units for values can often only be understood in the context of instrument and software.
Extensible Markup Language (XML) is a markup language to structure documents. The Document is organised by elements and their attributes and the scope of an element is marked by tags for the start and the end. Elements can be nested and may include values. This allows for a representation of complex data models, including multidimensional data and hierarchies. Because an XML file is a text file it is readable for humans, but the repeated tags distract from the information contained in values. Because of the verbosity XML files tend be grow fast and are often huge compared to other formats. Thus, XML allows for large datasets to store these as BASE64 encoded binaries. XML allows straight forward definition of specific formats by schemas and XML Schema Definition (XSD) is the most common standard for these. Implementations for reading, writing and validating XML files exist in most programming languages. For these reasons, many file formats in chemistry are based on XML.
As XML, the Javascript Object Notation (JSON) was created to exchange data in the World Wide Web. Where XML has element tags for structuring the document, JSON uses braces, brackets and commas. Because of less verbosity the format is better readable for humans and files may be smaller. With JSON Schema there is a draft for a schema format to describe JSON documents. Even if JSON is less common for file formats, it is often used if data are exchanged by Application Programming Languages (API).
Single datasets may be bundled to be stored as one unit, or large datasets may be split up to smaller files for better handling. Container formats allow to retain the relationship between the datafiles. This can be achieved by interlinking metadata in the datafiles or metafiles, by grouping files in containers or archives like zip or tar, or by structured container file formats. An example for the cont approach is HDF5, which allows to combine a file system like structure with metadata. There are many implementations and tools for HDF5 and some formats for chemical data are based on it.
Format standards in chemistry
Many formats used in chemistry are based on one of the generic formats from the last section. The following table lists some common formats used by chemists.
Format | Data type | Maintainer | Parent Format | Specification |
---|---|---|---|---|
JCAMP-DX | multiple | IUPAC | ASCII, Text | open |
AnIML | multiple | ASTM | XML | open |
netCDF | multiple | UCAR | CDF | open |
CSV | multiple | IETF-RFC | ASCII, Text | open |
ASCII | multiple | (open) | self explanatory | |
ISA | multiple | ISA Commons Community | TSV or JSON | open |
UDM | multiple | Pistoia Alliance | XML | open |
ADF | multiple | Allotrope | HDF5+RDF | for members |
mzML | mass spectrometry | HUPO/PSI | XML | open |
ANDI-MS | mass spectrometry | ASTM International | netCDF | open |
nmrML | NMR | COSMOS | XML | open |
NMReDATA | NMR | NMReDATA Initiative | SDF | open |
Bruker FID | NMR | Bruker | (binary) | proprietary |
mnova | NMR | Mestrelab | (binary) | proprietary |
Bruker OPUS | spectroscopy | Bruker | (binary) | proprietary |
Perkin Elmer | spectroscopy | Perkin Elmer | ASCII, Text | proprietary |
ThermofFisher Grams | spectroscopy | ThermoFisher | binary | proprietary |
The JCAMP-DX format can be applied for a wide range of spectral and analytical data. It was developed by the Joint Committee on Atomic and Molecular Physical Data (JCAMP) as a format for IR spectroscopy data since 1988 and is now maintained under the auspices of IUPAC. A general standard is proposed which can be used for different spectroscopic and spectrometric methods. Additionally, defined special standards for electron magnetic resonance spectroscopy and nuclear magnetic resonance (NMR), chromatography and mass spectrometry were published. Because the standard does not provide native support for ontology or controlled vocabulary use, and each implementation may use its own extensions, files from different sources can be incompatible. There is a Java reference implementation for this format available and there are libraries for other programming languages applicable like Python, R JavaScript and MATLAB. As JCAMP-DX is accepted as the exchange format for many analytical methods, there is a wide support by software for spectral analytics.
The XML-based Analytical Information Markup Language (AnIML) has been created to be an ASTM International standard and covers different analytical techniques. The standard comprises schema definitions for a generic core and technique-specific documents. Thus, it is possible to define technique documents for various analytical measurements. As AnIML is fully specified by its XML schemas, it can be effortlessly implemented in any language with XML support. No reference implementation is available, but among the (few) open-source implementations Jmol/JSmol (formerly JSpecView) can import and visualise AniML documents. For developers working with the python programming language, a library is under development to create, parse and validate AniML files. There is also support by BSSN Software (now Merck) that promotes the format, often in combination with the device interface SiLA (Standardisation in Lab Automation).
NetCDF is a binary file format and software interface, mainly defined by its implementations by the Unidata community. It is an abstract model, which can be extended by self-describing objects. Thus, this model can be flexibly adopted to specific use cases. A family of ANDI (ANalytical Data Interchange) formats is specified by the ASTM, which are based on netCDF (see also ANDI-MS below).
The ISA (Investigation-Study-Assay) framework, originated from the bioscience community, defines the hierarchical ISA data model to store metadata on project context and study details, and analytical measurement data. As the abstract ISA data model already encourages the user to annotate any parameter or value with ontology terms, it assures well described datasets. Implementations are available as tab separated value files (ISA-Tab) or as JSON (ISA-JSON). The ISA API is a Python library implementing the model for usage with the ISA formats. The model is also applied for repositories such as MetaboLights. Moreover, journals such as ScientificData or GigaScience use the ISA data model to describe complex experimental setups covered in the manuscripts.
Format standards for analytical Data
Experimental data obtained with spectroscopic methods such as infrared-spectroscopy, Raman and UV/Vis-spectroscopy, are often comparatively small in size and straightforward in their structure. Vendors store the raw data in proprietary formats, either as binary data or in ASCII. These can be exported as (or converted to) Excel spreadsheets or plain text tables with x,y pairs (or similar format). A header section may include metadata.
Users have to further process such data for their specific needs, and currently no overarching specifications exist. Repositories may convert these to a specified format, e.g. Chemotion ELN will convert text and Excel files to JCAMP-DX.
There are a few vendor formats which are popular for data exchange: GRAMS SPC, Perkin Elmer SP and Bruker OPUS files are supported not only by the format creators, but also by other vendors and instrument-agnostic software tools for e.g. statistical analysis.
There were efforts to create a special format for ultraviolet-visible spectroscopy data, which was called SpectroML, which have now been superseded by the more general AnIML . Harmonisation between instrument vendors and adoption of an open standard still needs to be achieved.
Data formats for nuclear magnetic resonance spectroscopy (NMR)
NMR is an indispensable analytical technique providing rich information on bonding and structure as well as molecule interaction and abundance of molecules in samples. Until now, it was common practice to publish the spectra as images in supplementary materials, regularly published as PDF files. Additionally, a list of shifts is reported, sometimes referred to as NMR text.
However, the raw data containing the Free Induction Decays (FIDs), initially processed spectra and the instrument metadata is usually not published, which might allow for reanalysis and reuse. The importance of providing both FID raw data and extracted NMR spectra were previously demonstrated extensively.
All instrument vendors have developed their own (binary) raw data formats. Since this FID data itself is mainly time-response data with a straight-forward structure, most of the vendor formats are supported by the software used within the NMR community. Many vendors also agreed to import and export the JCAMP-DX format, which has a specification for FID raw data and is recommended by IUPAC.
The JCAMP-DX format can also be used for exchange, import and export of multidimensional spectra. Because of the open and extensible nature of JCAMP-DX on the one hand and the lack of a controlled vocabulary on the other, there are already different flavours of the NMR format implementation, hence, validation or import might be challenging.
Inspired by the mass spectrometry format mzML (see below) the standard nmrML was initially developed for metabolomics data, but can also be used for any other kind of NMR data obtained. The standard nmrML is a XML based format for FID raw data for 1D as well as 2D NMR spectra. Due to the explicit syntax specification of this format and the underlying controlled vocabulary (nmrCV), data files can be validated. It is used as a storage format for NMR data in the Metabolights data repository.
The open NMReData format is maintained by the NMReData initiative. The NMR record in NMReDATA format includes the instrument (raw) data, a SDFile and, since version 2, also spectral data in JCAMP format in a folder, which can be compressed in the zip format for data exchange. The SDFile contains the chemical structure and the actual NMReDATA as standardised SDF tags. These tags take account of chemical shifts, couplings, signal assignments and lists of 2D correlations, only to mention a few. NMReData can be used for 1D and 2D spectra and contains a core set of NMR parameters. The format allows raw data, extracted data, and structures to be recorded in one format, which is not fully supported in existing formats. It is machine- and human-readable at the same time, and allows flexibility and extensions. FAIRness is the overall principle behind it. Members of the NMReData initiative include open-source projects such as NMRShiftDB2 and Cheminfo.org, commercial NMR software vendors such as MestreLab, NOMAD, C6H6 and ACD/Labs and device vendors like Bruker NMReData.
Recently, there were several additional open standard formats developed by the NMR community. A great effort came from protein structure determination by NMR, a field where the specifics and size of the macromolecules required special data formats.
Derived from the self-defining STAR format are the also more protein specific NMR-STAR which is used by the BioMagResBank (BMRB) data format and defines over 4600 data item tags describing data and metadata, which are organised in more than 300 categories and 80 category groups. The NMR Exchange Format (NEF) format was developed for storage of NMR data in wwPDB. It is more accessible for software developers by reducing the complexity. Additionally, it is extensible with application-specific tags. As NMR-STAR and NEF both are derived from the STAR format, they are convertible, and the only formats accepted by wwPDB and BMRB. The Collaborative Computing Project for NMR (CCPN) is developing NEF, based on the data model for usage within their protein NMR focused software tools.
Data standards in Mass Spectrometry
A distinction that is rather important in different disciplines of chemistry is whether a particular spectrum is the data of interest, or whether a set of spectra shall be represented. In the former case, text-based file formats like JCAMP-DX, Mascot Generic File (MGF) or National Institute for Standards and Technology Mass spectrometry (NIST MSP) may be sufficient. However, for entire runs using, e.g., LC-MS or GC-MS with hundreds of chromatography-resolved spectra, more efficient file formats have been developed. The netCDF (Network Common Data Form) based Analytical Data Interchange Protocol for Mass Spectrometry (ANDI-MS) is an ASTM International standard. It was developed initially as an Analytical Instrument Association (AIA) standard as a follow-up of the ANDI for Chromatographic Data specification. Technically, it builds upon NetCDF, a generic and highly efficient container format. The ANDI-MS specification defines which elements are needed to encode mass spectrometry data.
More complex MS experiments require capturing a rich set of instrumental settings such as per-scan polarity, isolation windows and collision energies. Several formats (mzXML, mzData) had been developed in the early days of proteomics, which have been merged into mzML by the Proteomics Standards Initiative (PSI). Despite the term Proteomics in its name, many of the PSI standards can also be used for respective analytical data from samples beyond Proteomics. The XML based mzML data format is a widely accepted standard for analytical mass spectrometry data, recommended by several societies and infrastructures for data exchange and archival. There is also a wide range of tools, including converters and spectra viewers, and software libraries to work with mzML files. The use of the PSI-MS ontology as controlled vocabulary, combined with data validators, provides excellent interoperability between consumers and producers of mzML, regardless of the instrument vendor or analysis software.
The XML-based nature of these formats ensures that the data is readable by most, if not all, computer systems and programming languages long-term. To improve performance for fast random access and parallel processing of data, the same data model was used in several formats like mz5, Toffee and mzMLb which are based on HDF5, which itself is a container format and can be considered the successor to netCDF.
Data Standards in X-ray Crystallography
Crystal structure analyses by X-ray diffraction are fundamental techniques in chemistry to determine the atomic and molecular structure of materials. These techniques measure the angles and intensities of a diffracted x-ray beam and calculate structural information from the data. In case of single-crystal measurements, the raw datasets can be very large, while other methods like powder x-ray diffraction produce only two-dimensional raw data. Therefore, data from the latter are exchanged in simple text files, exported from the instrument vendor software.
With the Crystallographic Information File (CIF), there is a common exchange format for crystallographic data, which is developed and maintained by the International Union for Crystallography. The CIF is an implementation of the STAR file format and thus a text file which is organised in data blocks which are described by data names or tags. These data names are defined in plain text dictionaries, which use a controlled language and are readable for humans and computers. Besides the core dictionary with tags relevant for small-molecule and inorganic crystals, there are dictionaries for special applications like powder x-ray diffraction or macromolecular crystals (mmCIF). The possibility to extend the format by adding new dictionaries makes it ready for new methods and applications. Furthermore, with the Crystallographic Information Framework there is also a data model, which relies on the same principles as the file format and can be adapted to specific applications.
Databases and repositories such as The Cambridge Structural Database (CSD) and Crystallography Open Database (COD) will only accept CIF as format to deposit crystallographic data.
Data Standards in X-Ray Absorption and Fluorescence Spectroscopy
X-Ray absorption (XAS) as well as x-ray fluorescence (XRF) generates simple spectra described by the monochromatic x-ray radiation on the abscissa and the absorption of the sample on the ordinate. The spectra can be exported to in formats based on CSV, with multiple columns, but the units (e.g. energy, wavelength), column format and the included metadata often depend on the software used for measurement on the beamline or instrument, which interferes with interoperability. To compare XAS and XRF data measured on different beamlines and devices, it is also important to include parameters of the instrument and calibration into the dataset. For larger sets of XAS data the HDF5 format is considered as standard format, which is already used by some beamline software, such as BLISS on the ESRF.
For the interchange of single X-Ray absorption spectra, the XAFS Data Interchange (XDI) format was proposed , which combines a dictionary of relevant metadata and the data table in a text file. Thus, it is readable for humans and computers and compatible with most of the existing software accepting x,y-tables. The authors of the format also provide an implementation in C and bindings for several other programming languages such as Fortran, Perl and Python. The format was already accepted for import to the reference sample database at Diamond Light Source and the X-ray Absorption Data Library of the International X-ray Absorption Society.
There is no existing standard for XRF files, so most of the software tools provide some kind of import dialog to select matching columns and units from text files exported by the vendor software or can read data contained in HDF5 files.
Format standards for structural Data
To describe chemical structures as a simple ASCII string there is the SMILES format, which is an open standard as OpenSMILES since 2007. For more information about SMILES and its variants read the simplified molecular-input line-entry system (SMILES) article in this knowledge base.
Another one line encoding of structures, mainly intended for searching information about compounds, is the International Chemical Identifier (InChi), which has also an open specification and is described separately.
A simple Text format for chemical structures is the XYZ format, which only contains the coordinates for the atoms of a molecule. As the format lacks a formal specification the implementations for XYZ may differ and result in incompatible files.
The Chemical Table File was originally created by MDL Information Systems, now BIOVIA. It is now an open standard, but download of the specification requires registration. The files are text based consisting of a header and several data blocks with the structural information for the molecules.
The structure data file (SDF) is a derivative of the chemical data file and can include multiple compounds in one file and associated dta for the elements. The SDF format was extended by NMReData to store NMR data and chemical structure in one file format.
The Chemical Markup Language (CML) is based on XML and is an approach to store many chemical data in an universal file format to store. Although it is capable to store also analytical data, it is mainly used for structural information and reactions.