The following collection of data sets has been prepared for DataLab, the files (in IDT format) may be loaded directly into DataLab. However, some of the data sets are not suitable for usage with the evaluation copy of DataLab because the evaluation copy is restricted to a maximum of 500 values (product of the number of columns times the number of rows). The column "Statistical Methods" provides a few hints which questions could be discussed with the corresponding data set using a particular statistical method.
||Some characteristic variables of bananas. The bananas have been obtained from various supermarkets; they have been weighted (both the whole bananas and their skins) and their geometric dimensions have been determined (two measures of length, and the diameter at the broadest location).
||Linear Regression: Try to find a model in order to estimate the weight of a banana from its length.
||Geometric distances of 100 genuine and 100 forged banknotes. The data have been taken courtesy to H. Riedwyl from the book B. Flury, H. Riedwyl, Angewandte multivariate Statistik, G.Fischer- Verlag, Stuttgart (1983).
||Discriminant Analysis: Try to develop a classifier which is able to discriminate genuine and forged banknotes.
||This data set contains the boiling points and some physico-chemical properties of 185 chemical substances.
||Stewise Regression: Try ro find a model which predicts the boiling points using MLR.
ANOVA: Does the boiling point depend on the number of branches in the molecule?
PLS: create an optimal PLS model and compare it to the MLR model obtained by stepwise regression.
|Countries of the World
||Some demographic and economic data of the countries of the world around 1989. The data have been obtained from the CIA Factbook (1989).
||Multiple Regression: Which factors does the life expectance depend on, which have a positive influence, which have a negative?
Cluster Analysis: Which countries are most similar to Austria?
||Frequencies of 2-character combinations obtained from two nearly identical statistical textbooks, one written in German, the other written in English: see Grundlagen der Statistik and Fundamentals of Statistics. The set of variables has been reduced to the 180 most abundant character combinations.
||Principal Component Analysis: check whether PCA indicates any differences of the two books.
PLS Discriminant Analysis: develop a binary classifier which is able to discriminate between English and German texts; find out which of the two-character combinations are most important to distinguish the two languages.
||Number of traded lynx pelts in Canada between 1821 and 1910. The data have been obtained from Elton, C. and M. Nicholson: "The ten-year cycle in numbers of the lynx in Canada", Journal of Animal Ecology 11 (1942):215-244
||Auto-correlation and Fourier Transform: What is the approximate population cycle length?
||The data set contains the results of chemical analyses of 32 mineral waters and the geographical coordinates of their sources. The data of the analyses have been taken from the labels of the water bottles.
||Multiple Linear Regression: Which constituents of the mineral waters play a role in forming the solid residues?
Cluster Analysis: Which mineral waters are most similar?
||Artificial dataset for simple regression exhibiting three different structures of the residuals.
||Linear Regression: What is the effect of non-symmetric redisuals on the results of a linear regression? See the DataLab blog for more details (in German language).
||Resistance thermometers use the electric resistance of a thin platin wire to determine the temperature. The data set contains of 15 calibration measurements, two of them being slightly incorrect. Because of the small deviations of the incorrect values the errors can be seen only in the residual plot.
||Parabolic Regression: Compare the calibration curves obtained by parabolic regression with and without the erroneous measurements.