Logistic Regression is a type of predictive model that can
be used when the target variable is a categorical variable with two categories
– for example live/die, has disease/doesn’t have disease, purchases product/doesn’t
purchase, wins race/doesn’t win, etc. A
logistic regression model does not involve decision trees and is more akin to
nonlinear regression such as fitting a polynomial to a set of data values.
Logistic regression can be used only with two types of
target variables:
1. A categorical
target variable that has exactly two categories (i.e., a binary or dichotomous
variable).
2. A continuous
target variable that has values in the range 0.0 to 1.0 representing
probability values or proportions.
As an example of logistic regression, consider a study whose
goal is to model the response to a drug as a function of the dose of the drug
administered. The target (dependent)
variable, Response, has a value 1 if the patient is successfully treated by the
drug and 0 if the treatment is not successful.
Thus the general form of the model is:
Response = f(dose)
The input data for Response will have the value 1 if the
drug is effective and 0 if the drug is not effective. The value of Response predicted by the model
represents the probability of achieving an effective outcome, P(Response=1|Dose). As with all probability values, it is in the range 0.0 to 1.0.
One obvious question is “Why not simply use linear
regression?” In fact, many studies have
done just that, but there are two significant problems:
1. There are no
limits on the values predicted by a linear regression, so the predicted
response might be less than 0 or greater than 1 – clearly nonsensical as a
response probability.
2. The response
usually is not a linear function of
the dosage. If a minute amount of the
drug is administered, no patients will respond.
Doubling the dose to a larger but still minute amount will not yield any
positive response. But as the dosage is
increases a threshold will be reached where the drug begins to become
effective. Incremental increases in the
dosage above the threshold usually will elicit an increasingly positive
effect. However, eventually a saturation
level is reached, and beyond that point increasing the dosage does not increase
the response.
The logistic regression dose-response curve has an S (sigmoidal) shape such as shown here:
Notice that all of the Response values are 0 or 1. The Dose varies from 0 to 25. Below a dose of 9 all of the Response values
are 0. Above a dose of 10 all of the
response values are 1.
The logistic model formula computes the probability of
the selected response as a function of the values of the predictor variables.
If a predictor variable is categorical variable with
two values, then one of the values is assigned the value 1 and the other is
assigned the value 0. Note that DTREG
allows you to use any value for categorical variables such as “Male” and “Female”, and it converts
these symbolic names into 0/1 values. So
you don’t have to be concerned with recoding categorical values.
If a predictor variable is a categorical variable with more
than two categories, then a separate dummy variable is generated to represent
each of the categories except for one which is excluded. The value of the dummy variable is 1 if the
variable has that category, and the value is 0 if the variable has any other
category; hence, no more than one dummy variable will be 1. If the variable has the value of the excluded
category, then all of the dummy variables generated for the variable are
0. DTREG automatically generates the
dummy variables for categorical predictor variables; all you have to do is
designate variables as being categorical.
In summary, the logistic formula has each continuous
predictor variable, each dichotomous predictor variable with a value of 0 or 1,
and a dummy variable for every category of predictor variables with more than
two categories less one category.
The form of the logistic model formula is:
P = 1/(1+exp(-(B0 + B1*X1 + B2*X2 + ... + Bk*Xk)))
Where B0 is a constant and Bi
are coefficients of the predictor variables (or dummy variables in the case of
multi-category predictor variables). The
computed value, P, is a probability
in the range 0 to 1. The exp() function is e raised to a
power. You can exclude the B0
constant by turning
off the option “Include
constant (intercept) term” on the logistic regression model property
page.
============ Logistic Regression
Parameters ============
Predict: DeathPenalty = 1 (Yes)
Number of
parameters calculated = 4
Number of data
rows used = 147
Wald confidence intervals are computed for 95% probability.
Log likelihood
of model = -88.142490
Deviance (-2 *
Log likelihood) = 176.284981
Akaike's Information Criterion (AIC) = 184.284981
Bayesian
Information Criterion (BIC) = 196.246711
The summary statistics begin by showing the name of the
target variable and the category of the target whose probability is being
predicted by the model. You can select
the category on the logistic regression property page for the analysis.
The log likelihood
of the model is the value that is maximized by the process that computes the
maximum likelihood value for the Bi
parameters.
The Deviance is
equal to -2*log-likelihood.
Akaike’s Information Criterion (AIC) is
-2*log-likelihood+2*k where k is the number of estimated parameters.
The Bayesian
Information Criterion (BIC) is -2*log-likelihood + k*log(n) where k is the number
of estimated parameters and n is the
sample size. The Bayesian Information
Criterion is also known as the Schwartz
criterion.
------------------ Computed Parameter (Beta) Values ------------------
Variable
Parameter Std.
Error Pr. Chi Sq. Lower C.I.
Upper C.I.
-------------- ----------
---------- ----------- ----------
----------
BlackDefendant 0.5952
0.3939 0.1308 -0.1769 1.3673
WhiteVictim 0.2565 0.4002 0.5216 -0.5279 1.0408
Serious 0.1871 0.0612 0.0022 0.0671 0.3070
Constant -2.6516 0.6748
< 0.0001 -3.9742 -1.3291
The computed beta parameters are the maximum likelihood
values of the Bi parameters
in the logistic regression model formula (see above). By using them in an equation with the
corresponding values of the predictor (X)
variables, you can compute the expected probability, P, for an observation.
In addition to the maximum likelihood value, the standard
error for the estimate is displayed and the Chi squared probability that the
true value of the parameter is not zero.
The last two columns display the Wald upper
and lower confidence intervals. You can select the confidence interval
percentage range on the Logistic Regression property page.
If a predictor variable is categorical, then a dummy
variable is generated for each category except for one. In this case, there is a Bi
parameter for each dummy variable, and the categories are
shown indented under the names of the variables like this:
--------------- Computed Parameter (Beta) Values ---------------
Variable Parameter Std. Error Pr. Chi Sq.
Lower C.I. Upper
C.I.
--------- ----------
---------- ----------- ----------
----------
Class
Crew
0.8845 0.1643 < 0.0001 0.5624 1.2065
First
1.7733 0.1896 < 0.0001 1.4016 2.1450
Second
0.7742 0.1921 < 0.0001 0.3977 1.1507
Age
Adult
-1.0225 0.2726 0.0002 -1.5568
-0.4881
Sex
Male
-2.2831 0.1534 < 0.0001 -2.5838
-1.9825
Constant 1.1915 0.2765
< 0.0001 0.6495 1.7334
------ Likelihood Ratio Statistics ------
Variable
L. Ratio DF Pr. Chi Sq.
-------------- ----------
---- -----------
BlackDefendant 2.321
1 0.12763
WhiteVictim
0.413 1 0.52020
Serious 10.234 1
0.00138
Constant 18.609 1
0.00002
If you enable the option “Compute likelihood ratio significance tests” on the
logistic regression property page, then a table similar to the one shown above
will be printed. The likelihood ratio
significance tests are computed by performing a logistic regression with each
parameter omitted from the model and comparing the log likelihood ratio for the
model with and without the parameter.
These significance tests are considered to be more reliable than the Wald significance test.
However, since the logistic regression must be recomputed with each
predictor omitted, the computation time increases in proportion to the number
of predictor variables. If a predictor
variable is a categorical variable with multiple categories, the significance
test is performed with all of the categories included and all of them excluded.
An iterative Newton-Raphson
algorithm is used to calculate the maximum likelihood values of the
parameters. This procedure uses the
partial second derivatives of the parameters in the Hessian matrix to guide incremental
parameter changes in an effort to maximize the log likelihood value for the
likelihood function. The algorithm
iterates until the absolute value of the largest parameter change is less than
the value specified for “Tolerance”
on the logistic regression property page.
Most logistic regression analyses converge to a solution in
a dozen or so iterations, but you may occasionally run into one that does not
converge. If this happens, try enabling
the option “Use Firth’s
procedure” on the logistic regression
property page. Firth’s procedure slows
down the calculations, but it usually results in achieving convergence. Note: if Firth’s procedure is enabled,
unbiased parameter values are calculated which may be somewhat different than
what you would get with Firth’s procedure turned off.
The Hessian matrix with the partial second derivatives
of the parameter values is used to guide the convergence process. If the Hessian matrix is singular, the
logistic regression procedure will be unsuccessful and a warning message will
be displayed.
Complete separation
is a condition where one predictor or a linear combination of predictors
perfectly predicts the target value. For
example, consider a situation where every value of the Response target variable
is 0 if Dose is less than 10 and every value is 1 if Dose is greater than
10. Then the value of Response can be
perfectly predicted by checking if Dose is less than or greater than 10. In this case it is impossible to compute the
maximum likelihood values for the Bi
parameters because the slope of the logistic function would be infinite.
At the beginning of each logistic regression analysis,
a check is made for complete separation on each predictor variable. If complete separation is detected, a report
will be generated similar to this:
----------- Report On Separation of Variables -----------
Warning:
Complete separation of target values occurs on Age
The example above indicates that values of the target
variable are completely determined by the Age predictor variable. If separation occurs for a particular
category of a multi-category predictor variable, the category will be shown in
brackets after the variable name, for example “Race[2]”.
Quasi-complete
separation occurs when values of the target
variable overlap or are tied at a single or only a few values of a predictor
variable. The analysis does not check
for quasi-complete separation, but the symptoms are extremely large calculated
values for the Bi parameters or
large standard errors. The analysis also
may fail to converge.
If complete or quasi-complete separation is detected, the
predictor variable(s) showing separation should be removed from the analysis.