Data collection and preparation

Data Preparation Protocol for ODAM Compliance¶

The purpose of this protocol is to describe all the steps involved in collecting, preparing and annotating the data from an experiment associated with an experimental design (DoE) that will then allow the user to benefit from the services offered by ODAM. The overall approach is based on good data management practices concerning data structuring and the description of structural metadata.

Indeed, the strong point of the approach is to define metadata in depth, i.e. at the level of the data itself (i.e. metadata at column-level such as factors, variables...) and not just descriptive metadata on the top of the dataset. Thus, having structural metadata allows datasets to achieve a higher level of interoperability and greatly facilitates functional interconnection and analysis in a broader context.

Based on an example

In order to illustrate the different stages of this protocol, we have chosen an example from an experiment on tomato fruits grown in a greenhouse. The aim of this study was to build a model of fruit growth. For this, a certain amount of data was required, and we will limit ourselves to some of them in order to simplify the size of the data set.
See the complete example:
- Data explorer https://pmb-bordeaux.fr/dataexplorer/?ds=frim1
- Dataverse : https://doi.org/10.15454/95JUTK

1- Data Gathering¶

In our data subset example, we have 5 data files, one by type of object (plants, harvests, samples, compounds and enzymes).

5 different entities within the study, each corresponding to a file of data tables:
- plants, harvests, samples, compounds and enzymes
2 factors:
- Treatment, Development stages
53 quantitative variables:
- compounds (12) + enzymes (38) + weight, height, diameter (3)

1 - First, we put them under the same directory by giving it a name corresponding to the study or project (e.g. acronym of the project with a suffix corresponding to a study)

2 - Data subsets files must be compliant with the TSV standard (Tab-Separator-Values).

So an ODAM dataset is a bundle that contains a set of TSV files. The TSV files are simple tables containing the data of the dataset. In choosing this format, we follow the 5 gold stars principle, considered as a good practice and a necessary and indispensable step towards "Linked Open Data".

Note : Data files must have the extension 'txt' in order to distinguish them from metadata files, see below.

Advice: To be sure to have the right format, do a copy of data from the spreadsheet then paste them into a new file, then save as TSV format (separator: a tab character)'

2 - Data structure and organization¶

Since all the experimental data tables were generated as part of an experiment associated with a Design of Experiment (DoE), each file thus contains data acquired sequentially as the experiment progressed.

There must therefore be a link between each file, i.e. information that connects them together. In most cases (if not all), this information corresponds to identifiers⁴ that make it possible to precisely reference within the experiment each of the elements belonging to the same observation entity¹ forming a coherent observation unit. For example, each plant, each sample has its own identifier, and each of these entities corresponds to a separate data file.

Well organized data means that each data table must be correctly structured, i.e.:

Each variable forms a column,
Each observation forms a line,
Each type of "unit observational" (defined as an entity) forms a table, i.e. a file,
Each data table file must have a column defined as an identifier (similar as a primary key) corresponding to each observation of the entity (e.g. plant, sample, …),
Missing values can either be an empty cell or have the value 'NA',
The header names must short without special characters. Only use the alphanumerical characters, and the underline character as word separator,
The file should only contain data in matrix form and nothing else, i.e. no annotation on top, bottom, or sides.

Example of the ‘samples.txt’ file:

The files generated during data collection have to be organized according to the entity-relationship model similar to relational database management systems (RDBMS). Indeed, each entity¹ corresponds to a type of collected data (samples, compounds, ...) for which is associated a set of attributes², i.e. a set of variables that may include observed or measured variables (quantitative or qualitative), controlled independent variables (factors) and an identifier.

Then, a link is established for each subset with the subset from which it was obtained, so that the links can be interpreted as "obtained from” as shown in the figure below:

We have to organize your data subsets so that links could be established between them. In practical, it means to add when necessary a column (colored in green in the figure below) containing the identifiers corresponding to the entity to which we want to connect the subset. It is to be noted that this duplication of identifiers must be the only redundant information, through all data subsets.

3 - Structural Metadata¶

ODAM provides a model for structuring both data and metadata that facilitates data handling and analysis.

Whatever the kind of experiment, this assumes a design of experiment (DoE) involving individuals, samples or whatever things, as the main objects of study and producing several experimental data tables. This also assumes the observation of dependent variables resulting from effects of some controlled independent variables (factors³).

Moreover, the objects of study usually have an identifier⁴ for each of them, and the variables can be quantitative⁵ or qualitative⁶.

We can have either one entity within the study or several kinds, but in this latter case, it must exist a relationship between entities that we assume of “obtained from" type as describe above.

Thus, the data table of samples can be viewed according to the repartition by category we have just introduced as shown below:

a data entity (data subset 'samples') consisting of its attributes (columns) divided by category (identifier, factor, quantitative, qualitative)

The four categories

Identifier: precisely reference within the experiment each of the elements belonging to the same observation entity forming a coherent observation unit. For example, each plant, each sample has its own identifier,
Factor: a factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. Treatments (control vs. stress), genotype (WT vs. mutant), the course of time (development stages) or even tissues, are typical factors of experiments.
Quantitative: Quantitative data are values that describe a measurable quantity, in the form of numbers that can be calculated.
Qualitative: Qualitative data describe qualities or characteristics. They answer questions such as "what type" or "what category". These values are no longer numbers, but a set of modalities. These values cannot be calculated.

In order to allow data to be ODAM-compliant, we need two specific files (subsets & attributes) to describe the structural metadata of the whole dataset, i.e. some minimal but relevant metadata. For that, two metadata files are required, namely s_subsets.tsv and a_attributes.tsv.

(1)The subset metadata file (s_subsets.tsv) makes it possible to associate each data subset with a key concept corresponding to the main entity of the subset, each subset being stored as a file (the grey rectangle). It also defines for each subset the link to the subset it originates. (2) The attribute metadata file (a_attributes.tsv) allows to annotate each attribute (concept/variable) with some minimal but relevant metadata, such as: its description with its unit, the data type, but also its category. The category defined by controlled vocabulary (CV) is used to specify the type of each variable. In each of these two files (subsets and attributes), it is possible to annotate each of the terms with unambiguous definitions through links to accessible definitions (standardized CV terms).

Thus constructed, this metadata constitutes a dictionary describing each file (subsets) as well as all the columns of the tables (attributes) offering a better guarantee in the correct (re)use of the data for users who have not produced these data. These metadata lets therefore non-expert users explore and visualize your data. By making data interoperable and reusable by both humans and machines, it also encourages data dissemination according to FAIR principles.

s_subsets.tsv¶

A metadata file allowing to associate with each subset of data a key concept corresponding to the main entity of the subset and the relations of the type "obtainedFrom" between these concepts.
The full version of this metadata file for the FRIM1 dataset can be accessed online using API:
- subset metadata

A column : Unique rank number of the data subset

B column : father rank

the rank of the data subset (father rank) from which the data subset was obtained, implying a 'obtained from' relationship between both data subsets

C column : short name of the data subset

i.e the entity name associated to the subset in the form of a short name
only the alphanumerical characters and the underscore are allowed (i.e. 'a-z', 'A-Z', '0-9' and '_').

D column : The identifier attributes

should be the only attribute declared as 'identifier' in the 'category' column in the a_attributes.tsv file (D column)
should be available as a column item in the corresponding data subset file

E column : names of the files

only the alphanumerical characters, the underscore and the dot are allowed ( i.e. 'a-z', 'A-Z', '0-9' and '_' , '.' )
Moreover, these names should not start with a digit!

F column : description of the entity

the allowed characters are: 0-9 a-z A-Z , : + * () [] {} - % ! | / . ?

G & H columns : annotations based on ontology

use an ontology term (G) along with its corresponding URL (H)
Sites such as BioPortal or AgroPortal are great sources for finding Control Vocabulary (CV) based on ontology
These annotations are optional but it is mandatory to specify at least one so that the data table has exactly 8 columns. So a minimum good practice is to put for the first an NULL annotation e.g. 'NULL' (G), '/voc/null' (H).

a_attributes.tsv¶

A metadata file allowing each attribute (concept/variable) to be annotated with some minimal but relevant metadata
The full version of this metadata file for the FRIM1 dataset can be accessed online using API:
- attribute metadata

A column : Short names of the data subsets

must be declared in the s_subsets.tsv file (C column) and vice versa.
only the alphanumerical characters and the underscore are allowed (i.e. 'a-z', 'A-Z', '0-9' and '_').

B column : attributes

short name of the variables (data table column names)
only the alphanumerical characters and the underscore are allowed (i.e. 'a-z', 'A-Z', '0-9' and '_').
a set of variables that may include observed or measured variables (quantitative or qualitative), controlled independent variables (factors) and identifiers.
one and only one attribute (B column) must be declared as 'identifier' in the 'category' column (D column) per data subset (A column)
This column can be easily filled by copy-paste from the data table files, as shown below:

C column : Entry

gives opportunity to make a selection on the attributes via web-services by associating them an alias name (called an “entry”).
only the alphanumerical characters and the underscore are allowed (i.e. 'a-z', 'A-Z', '0-9' and '_').
Example: we put ‘treatment’ as an entry for the ‘Treatment’ factor. By this, we could retrieve all samples data for Treatment equal to ‘Control’ by applying the API query, e.g. https://pmb-bordeaux.fr/getdata/query/frim1/(samples)/treatment/Control?format=xml
See Very detailed example of API querying

D column : Category

has a limited choice of words: the set of terms are fixed, namely: 'identifier', 'factor', 'quantitative', 'qualitative'. Leave as blank otherwise.
dependent variables resulting of effects of some controlled experimental factors must be defined as ‘factor’.
Each entity identifier must be defined as ‘identifier’
Variables can be defined as ‘quantitative’ or ‘qualitative’.
External identifier which serves as a link must have an empty cell.

E column : data types

the allowed names are restricted to 'numeric' or 'string'. All 'quantitative' variables must be ‘numeric' type and it is preferable if the 'qualitative' variables are ‘string' type.

F column : description of the attribute

the allowed characters are: 0-9 a-z A-Z , : + * () [] {} - % ! | / . ?
If a unit must be specified for a variable, we can add it in brackets at the end of the text of the description

G & H columns : annotations based on ontology

use an ontology term (G) along with its corresponding URL (H)
Sites such as BioPortal or AgroPortal are great sources for finding Control Vocabulary (CV) based on ontology
These annotations are optional but it is mandatory to specify at least one so that the data table has exactly 8 columns. So a minimum good practice is to put for all attributes of type 'identifier' the corresponding annotation within the EDAM ontology , i.e. 'identifier' (G), http://edamontology.org/data_0842(H) or 'Sample ID' (G), http://edamontology.org/data_3273(H)

4 - Additional information¶

Although descriptive metadata have to be associated with a suitable data repository in order to support data publishing (see Publish your data), it is nevertheless possible to add descriptive information about the dataset.

This information must be provided in markdown format, a text format with a simple formatting syntax. The name of the file must be 'infos.md' and must be added in the same directory as the dataset.
It is also possible to add images. To do this, you must create a directory named 'images'. To reference these images in the infos.md file, use the @@IMAGE@@ macro as a path. This macro will be automatically replaced by the correct URL when the page is loaded. An example is given below :
```
![frim1](@@IMAGE@@/tomato_icon.png) <font size="+3"> [Tomato][1] </font>
```
In the same way, it is also possible to add links to PDF files. To do this, you must create a directory named 'pdf'. To add links to PDF files in the infos.md file, use the @@PDF@@ macro as a path. This macro will be automatically replaced by the correct URL when the page is loaded. An example is given below :
```
* [Protocol](@@PDF@@/protocol.pdf)
```

5 - Final checking¶

So now our ODAM dataset directory should look like this:

To complete this phase of data preparation, here is a list of some points to check and summarized below:

#	Note Description
1	A directory named as the dataset name should be actually created in the data repository; Be careful in the spelling, see note 6;
2	The s_subsets.tsv and a_attributes.tsv files should be present in the data repository.
3	All data subset files declared in the s_subsets.tsv (col. E) should be available in the data repository
4	To be sure to have the right format, do a 'copy' of data from the spreadsheet then 'paste' them into a new file, then 'save as TSV format (separator: a tab character)'
5	1) all subsets in the a_attributes.tsv file (col. A) should be declared in the s_subsets.tsv file (col. C) 2) all subsets in the s_subsets.tsv file (col. C) should be declared in the a_attributes.tsv file (col. A) 3) all attribute names in the a_attributes.tsv file (col. B) should be available as a column in the corresponding data subset file declared in the s_subsets.tsv file (col. E)
6	Be careful in the spelling: 1) for data subset file names (col. E in s_subsets.tsv), identifier name (col. D in s_subsets.tsv), attribute names (col. B in a_attributes.tsv), subset short names (col. C in s_subsets.tsv and col. A in a_attributes.tsv) and entry names (col. C in a_attributes.tsv), only the alphanumerical characters and the underscore are allowed (i.e. 'a-z', 'A-Z', '0-9' and '_'). Moreover, these names should not start with a digit! 2) for categorical names (col. D in a_attributes.tsv), the number of terms and their spelling are fixed, namely: 'identifier', 'factor', 'quantitative’, 'qualitative'. 3) for type (col. E in a_attributes.tsv), the allowed names are restricted to 'numeric' or 'string'. 4) for description, the allowed characters are: 0-9 a-z A-Z , : + * () [] {} - % ! \| / . ?
7	Identifiers declared in the s_subsets.tsv file (col. D) 1) should be declared as 'identifier' in the 'category' column in the a_attributes.tsv file (col. D) 2) should be available as a column item in the corresponding data subset file 3) should be the only one attribute declared as identifier for the corresponding data subset file in the a_attributes.tsv file (col. D)
8	Each subset having a 'father_rank' greater than 0 in the s_subsets.tsv file (col. B) 1) should include in its data file a column corresponding to the identifier of the subset to which it is linked (i.e. corresponding to the father_rank in col. A) 2) should have the linked subset identifier with no category (i.e. empty) in the a_attributes.tsv file (col. D), except if the subset and the linked subset have the same identifier

Fortunately, all of these checks can be done for you.

See how it looks for our complete online example: https://pmb-bordeaux.fr/getdata/check/frim1

This assumes that you have installed ODAM software (see Installation) and that you know how to use the API (see Web API).

A data entity is an object in a data model. Data is typically designed by breaking things down into their smallest parts that are useful for representing data relationships. For example, a plant can generate a list of samples. Each sample can be associated with several types of analytical variables. All three objects: plant, sample and type of analytical variables are considered data entities. ↩↩
Data attributes are characteristics of a data object. From data science view, they are the features of a data entity. They exist most often as a column in a data table. ↩
A factor of an experiment is a controlled independent variable; a variable whose levels are set by the experimenter. Treatments (control vs. stress), genotype (WT vs. mutant), the course of time (development stages) or even tissues, are typical factors of experiments. ↩
Identifiers precisely reference within the experiment each of the elements belonging to the same observation entity forming a coherent observation unit. For example, each plant, each sample has its own identifier, ↩↩
Quantitative data are values that describe a measurable quantity, in the form of numbers that can be calculated. ↩
Qualitative data describe qualities or characteristics. They answer questions such as "what type" or "what category". These values are no longer numbers, but a set of modalities. These values cannot be calculated. ↩
Broman and Woo (2018) Data Organization in Spreadsheets, THE AMERICAN STATISTICIAN, doi:10.1080/00031305.2017.1375989 ↩