DTREG
  • DTREG is the ideal tool for modeling business and medical data with categorical variables such as sex, race and marital status.

  • Decision trees present a clear, logical model that can be understood easily by people who are not mathematically inclined.

  • If you have a need for linear or nonlinear regression analysis, check out the NLREG program.

  • You also should check out the News Rover program that automatically scans Usenet newsgroups, downloads messages of interest to you, decodes binary file attachments, reconstructs files split across multiple messages, and eliminates spam and duplicate files. News Rover also has a built-in MP3 music search engine and can quickly locate music files on any Usenet newsgroup.

    Logistic Regression


    Logistic Regression

    Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} o\:* {behavior:url(# --> Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} w\:* {behavior:url(# --> Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} .shape {behavior:url(# --> Error: Use of undefined macro: #VML <-- default# --> Error: Use of undefined macro: #VML <-- VML);} Introduction to Logistic Regression Error: Use of undefined macro: #0400 <-- 0400; mso-fareast-language:# --> Error: Use of undefined macro: #0400 <-- 0400; mso-bidi-language:# --> Error: Use of undefined macro: #0400 <-- 0400;}

    Introduction to Logistic Regression

     

    Logistic Regression is a type of predictive model that can be used when the target variable is a categorical variable with two categories – for example live/die, has disease/doesn’t have disease, purchases product/doesn’t purchase, wins race/doesn’t win, etc.  A logistic regression model does not involve decision trees and is more akin to nonlinear regression such as fitting a polynomial to a set of data values.

     

    Logistic regression can be used only with two types of target variables:

     

    1.  A categorical target variable that has exactly two categories (i.e., a binary or dichotomous variable).

    2.  A continuous target variable that has values in the range 0.0 to 1.0 representing probability values or proportions.

     

    As an example of logistic regression, consider a study whose goal is to model the response to a drug as a function of the dose of the drug administered.  The target (dependent) variable, Response, has a value 1 if the patient is successfully treated by the drug and 0 if the treatment is not successful.  Thus the general form of the model is:

     

       Response = f(dose)

     

    The input data for Response will have the value 1 if the drug is effective and 0 if the drug is not effective.  The value of Response predicted by the model represents the probability of achieving an effective outcome, P(Response=1|Dose).  As with all probability values, it is in the range 0.0 to 1.0.

     

    One obvious question is “Why not simply use linear regression?”  In fact, many studies have done just that, but there are two significant problems:

     

    1.  There are no limits on the values predicted by a linear regression, so the predicted response might be less than 0 or greater than 1 – clearly nonsensical as a response probability.

     

    2.  The response usually is not a linear function of the dosage.  If a minute amount of the drug is administered, no patients will respond.  Doubling the dose to a larger but still minute amount will not yield any positive response.  But as the dosage is increases a threshold will be reached where the drug begins to become effective.  Incremental increases in the dosage above the threshold usually will elicit an increasingly positive effect.  However, eventually a saturation level is reached, and beyond that point increasing the dosage does not increase the response.

     

    The Dose-Response Curve

     

    The logistic regression dose-response curve has an S (sigmoidal) shape such as shown here:

    Notice that all of the Response values are 0 or 1.  The Dose varies from 0 to 25.  Below a dose of 9 all of the Response values are 0.  Above a dose of 10 all of the response values are 1.

     

    The Logistic Model Formula

     

    The logistic model formula computes the probability of the selected response as a function of the values of the predictor variables.

     

    If a predictor variable is categorical variable with two values, then one of the values is assigned the value 1 and the other is assigned the value 0.  Note that DTREG allows you to use any value for categorical variables such as “Male” and “Female”, and it converts these symbolic names into 0/1 values.  So you don’t have to be concerned with recoding categorical values.

     

    If a predictor variable is a categorical variable with more than two categories, then a separate dummy variable is generated to represent each of the categories except for one which is excluded.  The value of the dummy variable is 1 if the variable has that category, and the value is 0 if the variable has any other category; hence, no more than one dummy variable will be 1.  If the variable has the value of the excluded category, then all of the dummy variables generated for the variable are 0.  DTREG automatically generates the dummy variables for categorical predictor variables; all you have to do is designate variables as being categorical.

     

    In summary, the logistic formula has each continuous predictor variable, each dichotomous predictor variable with a value of 0 or 1, and a dummy variable for every category of predictor variables with more than two categories less one category.

     

    The form of the logistic model formula is:

    P = 1/(1+exp(-(B0 + B1*X1 + B2*X2 + ... + Bk*Xk)))

    Where B0 is a constant and Bi are coefficients of the predictor variables (or dummy variables in the case of multi-category predictor variables).  The computed value, P, is a probability in the range 0 to 1.  The exp() function is e raised to a power.  You can exclude the B0 constant by turning off the option “Include constant (intercept) term” on the logistic regression model property page.

     

    Output Generated for a Logistic Regression Analysis

    Summary statistics for the model

     

    ============  Logistic Regression Parameters  ============

     

    Predict: DeathPenalty = 1  (Yes)

     

    Number of parameters calculated = 4

    Number of data rows used = 147

     

    Wald confidence intervals are computed for 95% probability.

     

    Log likelihood of model = -88.142490

    Deviance (-2 * Log likelihood) = 176.284981

    Akaike's Information Criterion (AIC) = 184.284981

    Bayesian Information Criterion (BIC) = 196.246711

     

    The summary statistics begin by showing the name of the target variable and the category of the target whose probability is being predicted by the model.  You can select the category on the logistic regression property page for the analysis.

     

    The log likelihood of the model is the value that is maximized by the process that computes the maximum likelihood value for the Bi parameters. 

    The Deviance is equal to -2*log-likelihood.

     

    Akaike’s Information Criterion (AIC) is -2*log-likelihood+2*k where k is the number of estimated parameters.

     

    The Bayesian Information Criterion (BIC) is -2*log-likelihood + k*log(n) where k is the number of estimated parameters and n is the sample size.  The Bayesian Information Criterion is also known as the Schwartz criterion.

     

     

    Computed Beta Parameters

      ------------------  Computed Parameter (Beta) Values  ------------------

     

       Variable      Parameter  Std. Error  Pr. Chi Sq.  Lower C.I.  Upper C.I.

    --------------  ----------  ----------  -----------  ----------  ----------

    BlackDefendant     0.5952      0.3939       0.1308     -0.1769      1.3673

    WhiteVictim        0.2565      0.4002       0.5216     -0.5279      1.0408

    Serious            0.1871      0.0612       0.0022      0.0671      0.3070

    Constant          -2.6516      0.6748     < 0.0001     -3.9742     -1.3291

     

    The computed beta parameters are the maximum likelihood values of the Bi parameters in the logistic regression model formula (see above).  By using them in an equation with the corresponding values of the predictor (X) variables, you can compute the expected probability, P, for an observation.

     

    In addition to the maximum likelihood value, the standard error for the estimate is displayed and the Chi squared probability that the true value of the parameter is not zero.  The last two columns display the Wald upper and lower confidence intervals.  You can select the confidence interval percentage range on the Logistic Regression property page.

     

    If a predictor variable is categorical, then a dummy variable is generated for each category except for one.  In this case, there is a Bi parameter for each dummy variable, and the categories are shown indented under the names of the variables like this:

     

      ---------------  Computed Parameter (Beta) Values  ---------------

     

    Variable    Parameter  Std. Error  Pr. Chi Sq.  Lower C.I.  Upper C.I.

    ---------  ----------  ----------  -----------  ----------  ----------

    Class

         Crew     0.8845      0.1643     < 0.0001      0.5624      1.2065

        First     1.7733      0.1896     < 0.0001      1.4016      2.1450

       Second     0.7742      0.1921     < 0.0001      0.3977      1.1507

    Age

        Adult    -1.0225      0.2726       0.0002     -1.5568     -0.4881

    Sex

         Male    -2.2831      0.1534     < 0.0001     -2.5838     -1.9825

    Constant      1.1915      0.2765     < 0.0001      0.6495      1.7334

     

     

    Likelihood Ratio Statistics

      ------  Likelihood Ratio Statistics  ------

     

       Variable      L. Ratio    DF   Pr. Chi Sq.

    --------------  ----------  ----  -----------

    BlackDefendant       2.321    1       0.12763

    WhiteVictim          0.413    1       0.52020

    Serious             10.234    1       0.00138

    Constant            18.609    1       0.00002

     

    If you enable the option “Compute likelihood ratio significance tests” on the logistic regression property page, then a table similar to the one shown above will be printed.  The likelihood ratio significance tests are computed by performing a logistic regression with each parameter omitted from the model and comparing the log likelihood ratio for the model with and without the parameter.  These significance tests are considered to be more reliable than the Wald significance test.  However, since the logistic regression must be recomputed with each predictor omitted, the computation time increases in proportion to the number of predictor variables.  If a predictor variable is a categorical variable with multiple categories, the significance test is performed with all of the categories included and all of them excluded.

     

     

    Computational Issues for Logistic Regression

    Failure to Converge

    An iterative Newton-Raphson algorithm is used to calculate the maximum likelihood values of the parameters.  This procedure uses the partial second derivatives of the parameters in the Hessian matrix to guide incremental parameter changes in an effort to maximize the log likelihood value for the likelihood function.  The algorithm iterates until the absolute value of the largest parameter change is less than the value specified for “Tolerance” on the logistic regression property page.

     

    Most logistic regression analyses converge to a solution in a dozen or so iterations, but you may occasionally run into one that does not converge.  If this happens, try enabling the option “Use Firth’s procedure on the logistic regression property page.  Firth’s procedure slows down the calculations, but it usually results in achieving convergence.  Note: if Firth’s procedure is enabled, unbiased parameter values are calculated which may be somewhat different than what you would get with Firth’s procedure turned off.

     

    Singular Hessian Matrix

     

    The Hessian matrix with the partial second derivatives of the parameter values is used to guide the convergence process.  If the Hessian matrix is singular, the logistic regression procedure will be unsuccessful and a warning message will be displayed.

     

    Complete and Quasi-Complete Separation of Values

    Complete separation is a condition where one predictor or a linear combination of predictors perfectly predicts the target value.  For example, consider a situation where every value of the Response target variable is 0 if Dose is less than 10 and every value is 1 if Dose is greater than 10.  Then the value of Response can be perfectly predicted by checking if Dose is less than or greater than 10.  In this case it is impossible to compute the maximum likelihood values for the Bi parameters because the slope of the logistic function would be infinite.

     

    At the beginning of each logistic regression analysis, a check is made for complete separation on each predictor variable.  If complete separation is detected, a report will be generated similar to this:

     

    -----------  Report On Separation of Variables  -----------

     

    Warning: Complete separation of target values occurs on Age

     

    The example above indicates that values of the target variable are completely determined by the Age predictor variable.  If separation occurs for a particular category of a multi-category predictor variable, the category will be shown in brackets after the variable name, for example “Race[2]”.

     

    Quasi-complete separation occurs when values of the target variable overlap or are tied at a single or only a few values of a predictor variable.  The analysis does not check for quasi-complete separation, but the symptoms are extremely large calculated values for the Bi parameters or large standard errors.  The analysis also may fail to converge.

     

    If complete or quasi-complete separation is detected, the predictor variable(s) showing separation should be removed from the analysis.

     

    DTREG Logistic Regression Control Screen



    Return to DTREG home page

    Download demonstration copy of DTREG.

    Download manual for DTREG.

    Download manual for DTREG COM Library.

    Purchase DTREG.