Large datasets

Large datasets¶

Limitations¶

There are some limitations to manage data with ODAM. The two main limitations are due to the use of sqlite3, namely:

The maximum number of attributes (i.e columns/variables) in a data subset (i.e an experimental data table) is 32767.
The maximum number of data subsets (i.e an experimental data table) is 64.

Despite these limitations, this makes it possible to handle many of the situations encountered in most experiments. Nevertheless, some advice can be given to get around these limitations.

1 - Limiting the number of variables

It can happen that we have tables of experimental data that exceed the limit of 32767 variables, e.g. with OMICS data (mass spectrometry). In this case, we can first play with the detection threshold (signal to noise ratio) when processing the raw data. Then we can select the most significant variables e.g. by ANOVA for each experimental factor.

2 - Limiting the number of data tables

A possible alternative is to see if it would be possible to split the dataset into two or even three datasets then build a collection, i) depending on the type of data, or ii) if there are data that could be put in a sort of "annex" dataset.

Large datasets in the data explorer¶

The number of variables has been limited to 1000 in the data explorer to keep a good reactivity with the data, i.e. with short response times.

The chosen approach consists in reordering the variables concerning the large subsets of data with decreasing significance. A selection of the most significant variables can be made, e.g. using ANOVA. Thus, the first 1000 variables selected in the data explorer will be the most significant ones related to the experimental design. To do this, an R script has been written and put in the github repository. You can easily update it to perform the selection based on other statistical approaches. This operation must be carried out once all the data has been formatted and deposited in the ODAM repository.

Note that this is only a rearrangement of variables, so no variables need to be removed from the data subset and it can therefore be downloaded in its entirety either by API or by the data explorer.