Random Forests are widely used for data prediction and interpretation purposes. They
show many appealing characteristics, such as the ability to deal with high dimensional data,
complex interactions and correlations. Furthermore, missing values can easily be processed
by the built-in procedure of surrogate splits. However, there is only little knowledge about
the properties of recursive partitioning in missing data situations. Therefore, extensive
simulation studies and empirical evaluations have been conducted to gain deeper insight.
In addition, new methods have been developed to enhance methodology and solve current
issues of data interpretation, prediction and variable selection.
A variable’s relevance in a Random Forest can be assessed by means of importance
measures. Unfortunately, existing methods cannot be applied when the data contain miss-
ing values. Thus, one of the most appreciated properties of Random Forests – its ability
to handle missing values – gets lost for the computation of such measures. This work
presents a new approach that is designed to deal with missing values in an intuitive and
straightforward way, yet retains widely appreciated qualities of existing methods. Results
indicate that it meets sensible requirements and shows good variable ranking properties.
Random Forests provide variable selection that is usually based on importance mea-
sures. An extensive review of corresponding literature led to the development of a new
approach that is based on a profound theoretical framework and meets important statis-
tical properties. A comparison to another eight popular methods showed that it controls
the test-wise and family-wise error rate, provides a higher power to distinguish relevant
from non-relevant variables and leads to models located among the best performing ones.
Alternative ways to handle missing values are the application of imputation methods
and complete case analysis. Yet it is unknown to what extent these approaches are able
to provide sensible variable rankings and meaningful variable selections. Investigations
showed that complete case analysis leads to inaccurate variable selection as it may in-
appropriately penalize the importance of fully observed variables. By contrast, the new
importance measure decreases for variables with missing values and therefore causes se-
lections that accurately reflect the information given in actual data situations. Multiple
imputation leads to an assessment of a variable’s importance and to selection frequencies
that would be expected for data that was completely observed. In several performance
evaluations the best prediction accuracy emerged from multiple imputation, closely fol-
lowed by the application of surrogate splits. Complete case analysis clearly performed
worst.