Missing data values are an unfortunate but frequent occurrence in many predictive modeling situations. For example, demographic information obtained for marketing analysis may have hundreds of variables, but not all of the information will be available (or even relevant) to some of the people. In medical studies some tests may be performed for some patients but not others.
Specifying missing values in input data
There are three ways to denote a missing value in an input data record:
- Leave the column blank.
- Put a single period (‘.’) in the column without any numbers around it.
- Put a question mark (‘?’) in the column.
Types of missing variables
DTREG recognizes three types of variables: target, predictor, and weight. If the target or weight variables for a data record have missing values, the data record is unconditionally excluded from the analysis. Also, if all of the predictor variables have missing values, the data record is excluded. But if some predictor variables are available but others have missing values, DTREG provides four methods for handling the data records with missing values:
Exclude the data row
The simplest way to deal with records having some missing predictor variable values is to exclude those rows from the analysis. If there are many data rows available and the percentage of rows with missing values is small, then this may be the best method. Excluding rows is fast, and it prevents any error from being introduced due to the missing values.
Replace missing values with median/mode values
The second approach is to replace missing predictor values by the median value of the variable. For categorical predictors, the mode (most frequent category) is used for the replacement. Using the median/mode introduces some error into the model for that variable, but it allows the non-missing values of the other predictors to contribute to the model.
The most sophisticated method is to use surrogate variables to impute the predictor values that are missing. A surrogate variable is another predictor variable that is associated (correlated) with the primary predictor variable. DTREG fits a linear or polynomial function to estimate the missing variable value based on the available value of the surrogate variable.
Before the model building process starts, DTREG examines each potential surrogate variable for each primary predictor variable and computes the association between the variables. Continuous and categorical predictor variables with two categories may have surrogates and be used as surrogates. Categorical variables with more than two categories cannot have surrogates nor can they be used as surrogates. The mode is used as the replacement value for categorical variables with more than two categories.
If there are n eligible variables, then n*(n-1) potential matches must be evaluated. For each potential variable pair, the association is calculated. The association measures how closely the variables are related. Association values range from 0 (no association) to 100 (perfect association). The surrogates with the highest association are connected to the primary predictor. So each predictor has a different set of surrogate variable functions.
The method used to compute the association depends on the type of the predictor:
- Continuous predictors – Linear regression is used to fit a function: predictor=f(surrogate). The association is then computed as 100 times the proportion of the variance of the predictor explained by the function. So if the function output exactly matches the predictor, the association is 100.
- Categorical predictors – A slightly different method is used to compute the association for categorical predictors with two categories. If the potential surrogate is also categorical, the values of the predictor and the surrogate are compared and the proportion of the values that match (have the same category) is computed; call this MatchProportion. Then association is computed using the formula: Association=200*abs(MatchProportion-0.5).
If the proportion of matching rows is 0.5, then the association is 0.0, because there is a 50/50 chance of a match. If the proportion matching is either 1.0 or -1.0 then the association is 100. A negative match proportion means that the variables are associated in the opposite direction. A match proportion of (-1.0) means that the category values are exactly opposite; hence, the predictor value can be imputed by reversing the category value of the surrogate. If the primary predictor is categorical and the surrogate is continuous, a function is fitted to the 0/1 predictor values and a threshold of 0.5 is used to convert the value computed by the function to the predictor category value.
When a predictor variable is encountered with a missing value, DTREG examines each of the associated surrogate variables looking for one that has a non-missing value on that data row. The surrogates are examined in the order of decreasing association values. When a surrogate variable is found with a non-missing value, the surrogate function is used to compute the replacement value for the variable. If all surrogates have missing values, the median/mode is used replace the missing value.
Surrogate variables are used (1) during the model building process, (2) when using the Score function to predict values for a data file, and (3) when the DTREG Class Library is used to predict values. If the Translate procedure is used to generate source code for a model, surrogate variable calculations are included in the generated source code.
A surrogate splitter is similar to a surrogate variable, but it is specialized for decision tree based models – Single Trees, TreeBoost, and Decision Tree Forests.
When a decision tree is created, each predictor variable is evaluated at each split point to determine how well it can partition the values. After the best predictor has been determined, other candidate predictors are examined and the splits generated by them are compared with the primary split. The association is computed by comparing the split generated by the predictor with the primary predictor. The best surrogate splitters are stored along with the primary splitter. If the primary splitter value is missing, surrogate splits are examined looking for a non-missing value on a surrogate predictor.
One of the key differences between surrogate variables as surrogate splitters is that a different set of surrogate splitters is stored for each split. So the same predictor may have different surrogate splitter variables at different spit points in the decision tree. In contrast, surrogate variables are computed once before the model building process begins, and the same set of surrogate variables is always used for a particular predictor variable.