SAS
Module 5 Classification Analysis
Classification:
- Similar as regression model, but dependent variable is a categorical attribute
- Common special case as binary classification
- Generally, we are more interested in estimating the probabilities that the output belongs to each category level (ex: 65% to 1, 35% to 0)
Logistic Regression Model: calculate the probabilities
Classifier:
- We always need to select a classification threshold probability for determining how to assign entities to predicted classifications. For example, if P(Y=1|X)>0.5, then classifier Y is 1, otherwise is 0.
- But we also need to do tradeoff between optimizing for false positives or false negatives, to select better threshold to minimize the misclassification rate.
- The way to do tradeoff is to calculate the Sensitivity and Specificity , draw ROC chart and calculate ROC separation (KS-Youden). The cutoff of the largest value of KS-Youden will be the best threshold to select.
Sensitivity = TP/(TP+FN) percentage of true positive results that are identified correctly
Specificity = TN/(TN+FP) percentage of true negative results that are identified correctly
KS-Youden = Sensitivity - (1-Specificity)
Lift Chart: the chart to evaluate how much the selected model is better than the random drawing and how far the selected model is away from the best model. Closer the model lift line to the best model line, better the model selected.
Multiple Logistic Regression:
-
Similar as the single logistic regression, just include multiple regressors and coefficients
-
But sometimes, there are Confounding issues: the regressor has different performance in single logistic regression and multiple logistic regression because some regressors in multiple logistic regression may correlated in such a way that distort the true relationship
-
In SAS, we can use “Group By” function to separate for each level of the categorical variable