Splitting a Data Set

Command: Tools -> Split Data...

During data analysis it is often necessary to create two or more disjoint subsets from a common set of data, which then can be used as training and test sets. DataLab therefore provides three ways of creating such subsets: (1) splitting of the variables (columns), (2) splitting of the objects (rows), and the creation of a test and a training set. The size of the data sets can be controlled by the scroll bar in the left center. The mode of selection can either be random, blocked or interleaved.

After choosing Tools/Split Data, a set-up box is displayed which allows the user to set the number of files to be created and the mode of sampling (random selection, blocked, or interleaved, and columnwise. vs. rowwise). The subsets are created from the current data matrix and are stored in the current working directoy using the ASC format.

The names of the subsets are created automatically from the file template by appending decimal numbers with two places.

The process of subset creation is started by clicking the command Do It.

Hint: The random selection option does not imply a randomisation of the data, because the last partial data set contains all data which have not been selected into any of the other data sets. This results in a sorted last data set if the original data matrix is sorted, too.

Last Update: 2012-Jul-28