Skip to content
Lukas Mueller edited this page Feb 9, 2024 · 8 revisions

Datasets

Introduction to Datasets

A special data type, called "Datasets", is available to create fine-grained definitions of data to be used in an analysis tool such as solGS, Heritability tool or Stability tool.

The wizard and accordingly the dataset concept views the data in the database as an N-dimensional cube in which the sides represent the different dimensions of the data. Dimensions include concepts such as 'location', 'year', 'breeding program', 'trials', 'accessions', etc. By specifying the dimensions and a list of objects in each dimension, the dataset is the intersect of all the different selection in the N-dimensional cube, which the database can retrieve. In addition, the database can, based on a given intersect, also generate the list of corresponding objects in any other dimension.

User interfaces for Datasets

Datasets are usually generated in the Wizard. In the Wizard, up to 4 dimensions can be selected, and among these dimensions, individual items can be selected, to specify, for example, accessions that have been grown in certain locations and seasons. The Dataset specifies the intersection of all the dimensions and selected items, and can calculate any of the unspecified dimensions.

Dataset details can be view on the dataset detail page. In the upper right corner of the webpage, when logged in, a link to the dataset overview page is available. Clicking on the Dataset of interest will open the dataset detail page. The detail page shows all the selected criteria, and also the distribution of phenotypic data that is tied to this dataset, with the different traits chosen from a trait select box.

Dataset outlier selection

From the distribution of the phenotypic data, outliers can be selected using several different methods.

  • A slider allows to exclude data that is a multiple of the standard deviation from the mean or from the median.
  • The Rosner method of outlier select can be applied.
  • [To Do] The outliers can be selected graphically by selecting areas on the graph.

Once outliers are selected, they can be stored in the dataset. In most analyses that accept datasets, the user can choose to include or disregard the outlier selections.

Database implementation

User defined datasets are stored in the sgn_people.sp_dataset table. The dataset info is stored as a jsonb string.

Perl Classes

The major Perl class to deal with Datasets is CXGN::Dataset.

CXGN::Dataset contains accessors to define the dimensions and the items in each dimension. The accessors are simply named for the dimension they represent, such as years, locations, etc.

The dataset can retrieve any other dimension that corresponds to the selected criteria using the retrieve_ functions. For example, retrieve_years will retrieve all the years that are in the dataset. This will work if years have previously been defined as a dimensions with specific items (years) in it, but it will also work if it has not been been selected as a dimension. In that case, the dimension will be calculated to match all the selected dimensions. For example, if a list of locations has been selected and a list of accessions, the retrieve_years call will retrieve only years in which the given accessions have been on fields in the given locations.