DataLab is a compact statistics package aiming at exploratory data analysis. Please visit the DataLab Web site for more information....


Splitting a Data Set

Command: Tools -> Split Data...

During data analysis it is often necessary to create two or more disjoint subsets from a common set of data, which then can be used as training and test sets. DataLab therefore provides three ways of creating such subsets: (1) splitting of the variables (columns), (2) splitting of the objects (rows), and (3) the creation of a test and a training set. The size of the datasets can be controlled by the scroll bar in the left center. The mode of selection can either be random, blocked or interleaved.

After choosing Tools/Split Data, a set-up box is displayed which allows the user to set the number of files to be created and the mode of sampling (random selection, blocked, or interleaved, and columnwise. vs. rowwise). The subsets are created from the current data matrix and are stored in the current working directoy using the ASC format. If you select the option "Training/Test Set" you can activate the option "Create all mutually exclusive sets" which create not only one test set/training set pair but all possible combinations which are mutually exclusive.

The names of the subsets are created automatically from the file template by appending decimal numbers with three places.

The process of subset creation is started by clicking the command Do It.

Hint: The random selection option does not imply a randomisation of the data, because the last partial dataset contains all data which have not been selected into any of the other datasets. This results in a sorted last dataset if the original data matrix is sorted, too.