写在前面
介绍了 6 种分类算法, 分别是
Linear discriminant analysis (LDA),
Quadratic discriminant analysis (QDA),
Logistic regression (LR),
Support vector machines (SVM),
K-nearest neighbour (KNN).
为了介绍这五种算法是怎么操作的,我们会使用一个模拟数据的例子,先介绍算法的原理,再使用的R语言搭建模型,再判断模型的拟合程度,再对多个算法进行对比。
我写的初稿就是英文,所以这里就直接用英文了,也许后面会翻译一个中文版本。
Linear discriminant analysis (LDA)
Description of the method:
The LDA algorithm starts by finding directions that maximize the separation between classes, then use these directions to predict the class of individuals. These directions, called linear discriminants, are a linear combinations of predictor variables.
LDA assumes that predictors are normally distributed (Gaussian distribution) and that the different classes have class-specific means and equal variance/covariance.
Analysis and results:
Use function “lda()” in “MASS” to build the model based on trainSet, make prediction on testSet. The prediction provides “class”, which is the predicted classes of observation, use it to compute the confusion matrix.
We can find:
- This model gives an accuracy rate 0.71 on testSet, which is barely good;
- Sensitivity is 0.27 and Specificity is 0.89, Sensitivity is low;
- Confusion matrix, of the 59 actual Group0 points, the system predicted that 43 were Group1, most of the points were misallocated. This is another way of showing Sensitivity (1-4359=0.27 ). Of the 141 Group1 points, the system predicted that 15 were Group0, only a small part of points were misallocated. This is another way of showing Specificity (1-15141=0.89 ). Again we can say Specificity is good but Sensitivity is too low.
> model1 <- lda(Group ~ X1+X2, data = trainSet)
> prediction1 <- model1 %>% predict(testSet)
> confusionMatrix(as.factor(prediction1$class),as.factor(testSet$Group))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 16 15
1 43 126
Accuracy : 0.71
95% CI : (0.6418, 0.7718)
No Information Rate : 0.705
P-V