Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter3

最新推荐文章于 2022-10-05 20:33:28 发布

zhuiyuan2012

最新推荐文章于 2022-10-05 20:33:28 发布

阅读量306

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/zhuiyuanzhongjia/article/details/80647301

版权

机器学习专栏收录该内容

37 篇文章 2 订阅

订阅专栏

Classification

1. MNIST

函数：fetch_mldata(), permutation(), SGDClassifier(),fit(),predict(),StratifiedKFold(),cross_val_score(),cross_val_predict()

2. Training a Binary Classifier

Stochastic Gradient Descent (SGD) classifier:

Batch Gradient Descent:

some_digit_image = some_digit.reshape(28, 28)修改shape里面参数，图像为何没有变化？

3. Performance Measures

(1) Measuring Accuracy Using Cross-Validation

This demonstrates why accuracy is generally not the preferred performance measure
for classifiers, especially when you are dealing with skewed datasets (i.e., when some
classes are much more frequent than others).

(2)Confusion Matrix:

A much better way to evaluate the performance of a classifier is to look at the confu‐
sion matrix.

函数：precision_score(), recall_score(),f1_score()

(3)Precision/Recall Tradeoff

函数：sgd_clf.decision_function(),cross_val_predict(), precision_recall_curve()

(4)The ROC Curve(receiver operating characteristic curve)受试者工作特征曲线 /感受性曲线（sensitivity curve）

函数：roc_curve(),roc_auc_score(),RandomForestClassifier(),cross_val_predict()

ROC curve plots the true positive rate (another name for recall) against the false positive rate

考虑一个二分问题，即将实例分成正类（positive）或负类（negative）。对一个二分问题来说，会出现四种情况。如果一个实例是正类并且也被预测成正类，即为真正类（True positive）,如果实例是负类被预测成正类，称之为假正类（False positive）。相应地，如果实例是负类被预测成负类，称之为真负类（True negative）,正类被预测成负类则为假负类（false negative）。

从列联表引入两个新名词。其一是真正类率(true positive rate ,TPR), 计算公式为 TPR=TP/ ( TP+ FN)，刻画的是分类器所识别出的正实例占所有正实例的比例。另外一个是假正类率(false positive rate, FPR),计算公式为 FPR= FP / (FP + TN)，计算的是分类器错认为正类的负实例占所有负实例的比例。还有一个真负类率（True Negative Rate，TNR），也称为specificity,计算公式为TNR= TN/ ( FP+ TN) = 1 - FPR。

在一个二分类模型中，对于所得到的连续结果，假设已确定一个阈值，比如说 0.6，大于这个值的实例划归为正类，小于这个值则划到负类中。如果减小阈值，减到0.5，固然能识别出更多的正类，也就是提高了识别出的正例占所有正例的比例，即TPR，但同时也将更多的负实例当作了正实例，即提高了FPR。为了形象化这一变化，在此引入ROC。

(5)AUC curve

AUC（Area Under Curve）被定义为 ROC曲线下的面积，显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y=x这条直线的上方，所以AUC的取值范围在0.5和1之间。使用AUC值作为评价标准是因为很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好，而作为一个数值，对应AUC更大的分类器效果更好。

Hopefully you now know how to train binary classifiers, choose the appropriate met‐ric for your task, evaluate your classifiers using cross-validation, select the precision/recall tradeoff that fits your needs, and compare various models using ROC curves
and ROC AUC scores

4. Multiclass Classification(also called multinomial classifiers)

(1)one-versus-all (OvA) strategy(also called one-versus-the-rest)

perform multiclass classification using multiple binary classifiers. when you want to classify an image, you get the decision score from each classifier for that image and you select the class whose classifier outputs the highest score.

(2)one-versus-one (OvO) strategy

Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to distinguish 0s and 2s, another for 1s and 2s, and so on.For the MNIST problem, this means training 45 binary classifiers! When you want to classify an image, you have to run the image through all 45 classifiers and see which class wins the most duels.

The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the two classes that it must distinguish.

For most binary classification algorithms, however, OvA is preferred. Scikit-Learn detects when you try to use a binary classification algorithm for a multiclass classification task, and it automatically runs OvA (except for SVM classifiers for which it uses OvO).

函数：decision_function(),OneVsOneClassifier(),predict_proba()

5.Error Analysis

Analyzing the confusion matrix can often give you insights on ways to improve your classifier

First, you need to divide each value in the confusion matrix by the number of images in the corresponding class, so you can compare error rates instead of absolute number of errors (which would make abundant classes look unfairly bad):

Now let’s fill the diagonal with zeros to keep only the errors, and let’s plot the result:

that rows represent actual classes, while columns represent predicted classes. Now you can clearly see the kinds of errors the classifier makes.

6.Multilabel Classification

函数：KNeighborsClassifier(),

In some cases you may want your classifier to output multiple classes for each instance. For example, consider a face-recognition classifier: what should it do if it recognizes several people on the same picture? Of course it should attach one label per person it recognizes.Such a classification system that outputs multiple binary labels is called a multilabel classification system.

7. Multioutput Classification

函数：randint()

It is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than two possible values).Notice that the classifier's output is multilabel (one label per pixel) and each label can have multiple values (pixel intensity ranges from 0 to 255).

zhuiyuan2012

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter3

ClassificationMNIST函数：fetch_mldata(), permutation(), SGDClassifier(),fit(),predict(),StratifiedKFold(),cross_val_score(),cross_val_predict()Performance Measures Measuring Accuracy Using Cross-Valid...
复制链接

扫一扫

专栏目录