Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter3

Classification


1. MNIST

函数:fetch_mldata(), permutation(), SGDClassifier(),fit(),predict(),StratifiedKFold(),cross_val_score(),cross_val_predict()

2. Training a Binary Classifier

Stochastic Gradient Descent (SGD) classifier:

Batch Gradient Descent:

some_digit_image = some_digit.reshape(28, 28)修改shape里面参数,图像为何没有变化?

3. Performance Measures

   (1) Measuring Accuracy Using Cross-Validation

    This demonstrates why accuracy is generally not the preferred performance measure
    for classifiers, especially when you are dealing with skewed datasets (i.e., when some
    classes are much more frequent than others).

     (2)Confusion Matrix:

     A much better way to evaluate the performance of a classifier is to look at the confu‐
     sion matrix.

      

     

  

      函数:precision_score(), recall_score(),f1_score()

  (3)Precision/Recall Tradeoff

     函数:sgd_clf.decision_function(),cross_val_predict(), precision_recall_curve()

  (4)The ROC Curve(receiver operating characteristic curve)受试者工作特征曲线 /感受性曲线(sensitivity curve)

  函数:roc_curve(),roc_auc_score(),RandomForestClassifier(),cross_val_predict()    

   ROC curve plots the true positive rate (another name for recall) against the false positive rate

考虑一个二分问题,即将实例分成正类(positive)或负类(negative)。对一个二分问题来说,会出现四种情况。如果一个实例是正类并且也被 预测成正类,即为真正类(True positive),如果实例是负类被预测成正类,称之为假正类(False positive)。相应地,如果实例是负类被预测成负类,称之为真负类(True negative),正类被预测成负类则为假负类(false negative)。

从列联表引入两个新名词。其一是真正类率(true positive rate ,TPR), 计算公式为 TPR=TP/ ( TP+ FN),刻画的是 分类器所识别出的 正实例占所有正实例的比例。另外一个是假正类率(false positive rate, FPR),计算公式为 FPR= FP / (FP + TN),计算的是分类器错认为正类的负实例占所有负实例的比例。还有一个真负类率(True Negative Rate,TNR),也称为specificity,计算公式为TNR= TN/ ( FP+ TN) = 1 - FPR
在一个二分类模型中,对于所得到的连续结果,假设已确定一个阈值,比如说 0.6,大于这个值的实例划归为正类,小于这个值则划到负类中。如果减小阈值,减到0.5,固然能识别出更多的正类,也就是提高了识别出的正例占所有正例的比例,即TPR,但同时也将更多的负实例当作了正实例,即提高了FPR。为了形象化这一变化,在此引入ROC。
(5)AUC curve
AUC(Area Under Curve)被定义为 ROC曲线 下的面积,显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y=x这条直线的上方,所以AUC的取值范围在0.5和1之间。使用AUC值作为评价标准是因为很多时候ROC曲线并不能清晰的说明哪个分类器的效果更好,而作为一个数值,对应AUC更大的分类器效果更好。
Hopefully you now know how to train binary classifiers, choose the appropriate met‐ric for your task, evaluate your classifiers   using cross-validation, select the precision/recall tradeoff that fits your needs, and compare various models using ROC curves
and ROC AUC scores
4. Multiclass Classification(also called multinomial classifiers)
(1)one-versus-all (OvA) strategy(also called one-versus-the-rest)
perform multiclass classification using multiple binary classifiers. when you want to classify an image, you get the decision score           from each classifier for that image and you select the class whose classifier outputs the highest score.

     (2)one-versus-one (OvO) strategy

      Another strategy is to train a binary classifier for every pair of digits: one to distinguish 0s and 1s, another to                 distinguish 0s and 2s, another for 1s and 2s, and so on.For the MNIST problem, this means training 45 binary                classifiers! When you want to classify an image, you have to run the image through all 45 classifiers and see                 which  class wins the most duels.

     The main advantage of OvO is that each classifier only needs to be trained on the part of the training set for the          two classes that it must distinguish.

     For most binary classification algorithms, however, OvA is preferred. Scikit-Learn detects when you try to use a           binary classification algorithm for a multiclass classification task, and it automatically runs OvA (except for SVM           classifiers for which it uses OvO).

    函数:decision_function(),OneVsOneClassifier(),predict_proba()

  5.Error Analysis

    Analyzing the confusion matrix can often give you insights on ways to improve your  classifier

     

First, you need to divide each value in the confusion matrix by the number of images in the corresponding class, so you can compare error rates instead of absolute number of errors (which would make abundant classes look unfairly bad):

Now let’s fill the diagonal with zeros to keep only the errors, and let’s plot the result:

that rows represent actual classes, while columns represent predicted classes. Now you can clearly see the kinds of errors the classifier makes.


   6.Multilabel Classification

  函数:KNeighborsClassifier(),

 In some cases you may want your classifier to output multiple classes for each instance. For example, consider a face-recognition classifier: what should it do if it recognizes several people on the same picture? Of course it should attach one label per person it recognizes.Such a classification system that outputs multiple binary labels is called a multilabel classification system.

   7. Multioutput Classification

  函数:randint()

  It is simply a generalization of multilabel classification where each label can be multiclass (i.e., it can have more than    two possible values).Notice that the classifier's output is multilabel (one label per pixel) and each label can have          multiple values (pixel intensity ranges from 0 to 255).

 






  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值