Probabilistic (PNN) and
General Regression Neural Networks (GRNN) have similar
architectures, but there is a fundamental difference: Probabilistic
networks perform classification where the target variable is
categorical, whereas general regression neural networks perform
regression where the target variable is continuous. If you select a
PNN/GRNN network, DTREG will automatically select the correct type of
network based on the type of target variable.
DTREG also provides Multilayer Perceptron
Neural Networks and Cascade Correlation
Neural Networks.
PNN and GRNN networks have advantages and disadvantages compared to
Multilayer Perceptron networks:
- It is usually much faster to train a PNN/GRNN network than a multilayer
perceptron network.
- PNN/GRNN networks often are more accurate than multilayer perceptron
networks.
- PNN/GRNN networks are relatively insensitive to outliers
(wild points).
- PNN networks generate accurate predicted target
probability scores.
- PNN networks approach Bayes optimal
classification.
- PNN/GRNN networks are slower than multilayer perceptron networks at
classifying new cases.
- PNN/GRNN networks require more memory space
to store the model.
How PNN/GRNN networks work
Although the implementation is very
different, probabilistic neural networks are conceptually similar to
K-Nearest Neighbor (k-NN) models. The basic idea is that a predicted
target value of an item is likely to be about the same as other items
that have close values of the predictor variables. Consider this
figure:
Assume that each case in the training set has two predictor variables,
x and y.
The cases are plotted using their x,y coordinates as shown
in the figure. Also assume that the target variable has two
categories, positive which is denoted by a square and
negative which
is denoted by a dash. Now, suppose we are trying to predict the value
of a new case represented by the triangle with predictor values
x=6, y=5.1. Should we predict the target as positive or negative?
Notice that the triangle is position almost exactly on top of a dash
representing a negative value. But that dash is in a fairly unusual
position compared to the other dashes which are clustered below the
squares and left of center. So it could be that the underlying
negative value is an odd case.
The nearest neighbor classification performed for this example depends
on how many neighboring points are considered. If 1-NN is used and
only the closest point is considered, then clearly the new point
should be classified as negative since it is on top of a known
negative point. On the other hand, if 9-NN classification is used and
the closest 9 points are considered, then the effect of the
surrounding 8 positive points may overbalance the close negative
point.
A probabilistic neural network builds on this foundation and
generalizes it to consider all of the other points. The distance is
computed from the point being evaluated to each of the other points,
and a radial basis function (RBF)
(also called a kernel function) is
applied to the distance to compute the weight (influence) for each
point. The radial basis function is so named because the radius
distance is the argument to the function.
Weight = RBF(distance)
The further some other point is from the new point, the less influence
it has.
Radial Basis Function
Different types of radial basis functions could be used, but the most
common is the Gaussian function:
If there is more than one predictor variable, then the RBF function
has as many dimensions as there are variables. Here is a RBF function
for two variables:
The best predicted value for the new point is found by summing the
values of the other points weighted by the RBF function.
The peak of the radial basis function is always centered on the point
it is weighting. The sigma value (σ) of the function determines the
spread of the RBF function; that is, how quickly the function declines
as the distance increased from the point.
With larger sigma values and more spread, distant points have a
greater influence.
The primary work of training a PNN or GRNN network is selecting the
optimal sigma values to control the spread of the RBF functions.
DTREG uses the conjugate gradient algorithm to compute the optimal
sigma values.
Suppose our goal is to fit the following function:
If the sigma values are too large, then the model will not be able to
closely fit the function, and you will end up with a fit like this:
If the sigma values are too small, the model will overfit the data
because each training point will have too much influence:
DTREG allows you to select whether a single sigma value should be used
for the entire model, or a separate sigma for each predictor variable,
or a separate sigma for each predictor variable and target category.
DTREG uses the jackknife method of evaluating sigma values during the
optimization process. This measures the error by building the model
with all training rows except for one and then evaluating the error
with the excluded row. This is repeated for all rows, and the error
is averaged.
Architecture of a PNN/GRNN Network
In 1990, Donald F. Specht proposed a method to formulate the
weighted-neighbor method described above in the form of a neural
network. He called this a “Probabilistic Neural Network”.
Here is a diagram of a PNN/GRNN network:
All PNN/GRNN networks have four layers:
- Input layer — There is one neuron in the input layer for each
predictor variable. In the case of categorical variables, N-1 neurons
are used where N is the number of categories. The input neurons (or
processing before the input layer) standardizes the range of the
values by subtracting the median and dividing by the interquartile
range. The input neurons then feed the values to each of the neurons
in the hidden layer.
- Hidden layer — This layer has one neuron for each
case in the training
data set. The neuron stores the values of the predictor variables for
the case along with the target value. When presented with the x
vector of input values from the input layer, a hidden neuron computes
the Euclidean distance of the test case from the neuron’s center point
and then applies the RBF kernel function using the sigma value(s).
The resulting value is passed to the neurons in the pattern layer.
- Pattern layer / Summation layer —
The next layer in the network is
different for PNN networks and for GRNN networks. For PNN networks
there is one pattern neuron for each category of the target variable.
The actual target category of each training case is stored with each
hidden neuron; the weighted value coming out of a hidden neuron is fed
only to the pattern neuron that corresponds to the hidden neuron’s
category. The pattern neurons add the values for the class they
represent (hence, it is a weighted vote for that category).
For GRNN networks, there are only two neurons in the pattern layer.
One neuron is the denominator summation unit the other is the
numerator summation unit. The denominator summation unit adds up the
weight values coming from each of the hidden neurons. The numerator
summation unit adds up the weight values multiplied by the actual
target value for each hidden neuron.
- Decision layer —
The decision layer is different for PNN and GRNN
networks. For PNN networks, the decision layer compares the weighted
votes for each target category accumulated in the pattern layer and
uses the largest vote to predict the target category.
For GRNN networks, the decision layer divides the value accumulated in
the numerator summation unit by the value in the denominator summation
unit and uses the result as the predicted target value.
Removing unnecessary neurons
One of the disadvantages of PNN/GRNN models compared to multilayer perceptron
networks is that PNN/GRNN models
are large due to the fact that there is one neuron for each training row.
This causes the model to run slower than multilayer perceptron networks
when using scoring to predict values for new rows.
DTREG provides an option to cause it remove unnecessary neurons from the model after the model has been constructed.
Removing unnecessary neurons has three benefits:
- The size of the stored model is reduced.
- The time required to apply the model during scoring is reduced.
- Removing neurons often improves the accuracy of the model.
The process of removing unnecessary neurons is an iterative process.
Leave-one-out validation is used to measure the error of the model with each neuron removed.
The neuron that causes the least increase in error (or possibly the largest reduction in error) is then removed
from the model. The process is repeated with the remaining neurons until the stopping criterion is reached.
When unnecessary neurons are removed, the “Model Size” section of the analysis report shows how the error changes
with different numbers of neurons. You can see a graphical chart of this by clicking Chart/Model size.
There are three criteria that can be selected to guide the removal of neurons:
- Minimize error – If this option is selected, then DTREG removes neurons as long as the leave-one-out error remains constant or decreases. It stops when it finds a neuron whose removal would cause the error to increase above the minimum found.
- Minimize neurons – If this option is selected, DTREG removes neurons until the leave-one-out error would exceed the error for the model with all neurons.
- # of neurons – If this option is selected, DTREG reduces the least significant neurons until only the specified number of neurons remain.
The PNN/GRNN Property Page
Controls for PNN and GRNN analyses are provided on a screen in DTREG that has the
following image: