Random Forest Classifier

DataLab is a compact statistics package aiming at exploratory data analysis. Please visit the DataLab Web site for more information....

Home

Features of DataLab

Mathematical/Statistical Analysis

Classification & Clustering

Random Forest Classifier

Index

Statistical Background

Random Forest Classifier

Command: Math -> Random Forest -> Create Model

Random forests can be used both for classification and regression purposes. The DataLab random forest tool automatically recognizes dichotomous target variables exhibiting the values 0 and 1 (binary variables) and activates the classification mode in the case of a binary target variable. The difference between regression and classification mode is in the way and whether certain performance metrics are calculated. In classification mode DataLab displays an additional tab "Classification Results".

The random forest classification tool provides several tabs showing different kinds of information:

Classification Results: This tab is only visible if the target variable is binary. It shows the classifier performance metrics, the corresponding confusion matrix and the ROC curve.
Actual vs. Estimated: This tab shows the estimated target data plotted against the actual ones.
Residuals: The corresponding residuals.
Variable Importance: The variable importance plot gives you an indication which of the descriptors contribute most to the model. You can move a horizontal threshold in the bar graph to select the most important ones. Clicking the "copy variables" button () allows to replace the current set of descriptors by the most important ones.
R Scan: The R Scan tab allows to find the best resampling parameter R. Click "Start R Scan" to scan the entire allowed range of R values. Please note that the R Scan results depend on the number of trees. Thus you have to repeat the scan when you change the number of trees.
Tree Scan: The Tree Scan tab allows to find the optimum number of trees. Click "Start Tree Scan" to scan the number of trees. Please note that the Tree Scan results depend on the R parameter. Thus you have to repeat the scan when you change the R parameter.
Cross Validation: Calculates the cross validated results for repeated experiments.
Details: Here you find the results listed in a text document which can be easily copied for reporting purposes.

How To: Training a random forest classifier is straight forwarward, please follow these steps:

Select the descriptors and the target variable. The target variable has to be binary (containing only 0s and 1s).
Set the parameters R and Number of Trees. In most the cases, the default values will work well.
Click "Calculate" to start the calculation of the random forest model.
If you are satisfied with the model save it on disk in order to be able to apply it to unknown data.