2.1 如何评价模型好坏_学习笔记

一、判断模型好坏

1、鸢尾花train_test

鸢尾花数据集是UCI数据库中常用数据集。我们可以直接加载数据集,并尝试对数据进行一定探索:

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
iris = datasets.load_iris()
X = iris.data
y = iris.target
X.shape
(150, 4)
y.shape
(150,)

将数据集打乱,做一个shuffle操作。但是本数据集的特征和标签是分开的,分别乱序后,原来的对应关系就不存在了。有两种方法解决这一问题:

  • 将X和y合并为同一个矩阵,然后对矩阵进行shuffle,之后再分解
  • 对y的索引进行乱序,根据索引确定与X的对应关系,最后再通过乱序的索引进行赋值
# 方法1
# 使用concatenate函数进行拼接,因为传入的矩阵必须具有相同的形状。  
#因此需要对label进行reshape操作,reshape(-1,1)表示行数自动计算,1列。axis=1表示纵向拼接。
tempConcat = np.concatenate((X, y.reshape(-1,1)), axis=1)
# 拼接好后,直接进行乱序操作
np.random.shuffle(tempConcat)  
# 再将shuffle后的数组使用split方法拆分
shuffle_X,shuffle_y = np.split(tempConcat, [4], axis=1)
# 设置划分的比例
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
X_train = shuffle_X[test_size:]
y_train = shuffle_y[test_size:]
X_test = shuffle_X[:test_size]
y_test = shuffle_y[:test_size]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120, 1)
(30, 1)
# 方法2
# 将x长度这么多的数,返回一个新的打乱顺序的数组,注意,数组中的元素不是原来的数据,而是混乱的索引
shuffle_index = np.random.permutation(len(X))
# 指定测试数据的比例
test_ratio = 0.2
test_size = int(len(X) * test_ratio)
test_index = shuffle_index[:test_size]
train_index = shuffle_index[test_size:]
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120,)
(30,)

2、编写自己的train_test_split

编写一个自己的train_test_split方法,https://github.com/japsonzbz/ML_Algorithms

#调用
from model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120,)
(30,)

简单验证一下,X_train, y_train通过fit传入算法,然后对X_test做预测,得到y_predict

from kNN import kNNClassifier

my_kNNClassifier = kNNClassifier(k=3)
my_kNNClassifier.fit(X_train, y_train)

y_predict = my_kNNClassifier.predict(X_test)
y_predict
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2])
y_test
array([1, 0, 1, 0, 0, 1, 0, 1, 1, 2, 2, 0, 0, 0, 1, 2, 1, 0, 0, 2, 0, 0,
       2, 0, 1, 2, 2, 1, 0, 0])
# 两个向量的比较,返回一个布尔型向量,对这个布尔向量(faluse=1,true=0)sum
sum(y_predict == y_test)
7
#正确率
sum(y_predict == y_test)/len(y_test)
0.23333333333333334

3、sklearn中的train_test_split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(120, 4)
(30, 4)
(120,)
(30,)

二、分类准确度accuracy

因accuracy定义清洗、计算方法简单,因此经常被使用。但是它在某些情况下并不一定是评估模型的最佳工具。精度(查准率)和召回率(查全率)等指标对衡量机器学习的模型性能在某些场合下要比accuracy更好。

1、数据探索

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# 手写数字数据集,封装好的对象,可以理解为一个字段
digits = datasets.load_digits()
# 可以使用keys()方法来看一下数据集的详情
digits.keys()
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

sklearn.datasets提供的数据描述:

# 5620张图片,每张图片有64个像素点即特征(8*8整数像素图像),每个特征的取值范围是1~16(sklearn中的不全),对应的分类结果是10个数字
print(digits.DESCR)
.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

.. topic:: References

  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.
# 特征的shape
X = digits.data
X.shape
(1797, 64)
# 标签的shape
y = digits.target
y.shape
(1797,)
# 标签分类
digits.target_names
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# 取出某一个具体的数据,查看其特征以及标签信息
some_digit = X[666]
some_digit
array([ 0.,  0.,  5., 15., 14.,  3.,  0.,  0.,  0.,  0., 13., 15.,  9.,
       15.,  2.,  0.,  0.,  4., 16., 12.,  0., 10.,  6.,  0.,  0.,  8.,
       16.,  9.,  0.,  8., 10.,  0.,  0.,  7., 15.,  5.,  0., 12., 11.,
        0.,  0.,  7., 13.,  0.,  5., 16.,  6.,  0.,  0.,  0., 16., 12.,
       15., 13.,  1.,  0.,  0.,  0.,  6., 16., 12.,  2.,  0.,  0.])
y[666]
0
# 也可以这条数据进行可视化
some_digmit_image = some_digit.reshape(8, 8)
plt.imshow(some_digmit_image_image, cmap = matplotlib.cm.binary)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K9R85jRU-1584261146762)(output_34_0.png)]

2、自己实现分类准确度

在分类任务结束后,我们就可以计算分类算法的准确率

X_train, X_test, y_train, y_test = train_test_split(X, y)
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
# 比对y_predict和y_test结果是否一致
sum(y_predict == y_test) / len(y_test)
0.9844444444444445

工程文件中添加一个metrics.py,用来度量性能的各种指标,封装函数

y_test.reshape(-1,1)
y_predict.reshape(-1,1)
y_test.shape[0] == y_predict.shape[0]
True
#调用
from metrics import accuracy_score
accuracy_score(y_test, y_predict)
0.9844444444444445

用classifier将我们的预测值y_predict预测出来了,再去看和真值的比例。但是有时候我们对预测值y_predict是多少不感兴趣,我们只对模型的准确率感兴趣,kNN算法模型中进一步封装一个score函数。

knn_clf.score(X_test, y_test)
0.9844444444444445

3、sklearn中的准确度

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
y_predict = knn_clf.predict(X_test)
accuracy_score(y_test, y_predict)
0.9888888888888889

三、超参数

所谓超参数,就是在机器学习算法模型执行之前需要指定的参数。(调参调的就是超参数) 如kNN算法中的k。

与之相对的概念是模型参数,即算法过程中学习的属于这个模型的参数(kNN中没有模型参数,回归算法有很多模型参数)

1、寻找好的k

# 指定最佳值的分数,初始化为0.0;设置最佳值k,初始值为-1
best_score = 0.0
best_k = -1
for k in range(1, 11):  # 暂且设定到1~11的范围内
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score
print("best_k = ", best_k)
print("best_score = ", best_score)
best_k =  4
best_score =  0.9916666666666667

2、权重

若距离样本数据点最近的节点,对其影响最大,可使用距离的倒数作为权重。

在 sklearn.neighbors 的构造函数 KNeighborsClassifier 中有一个参数:weights,默认是uniform即不考虑距离,也可以写distance来考虑距离权重(默认是欧拉距离,如果要是曼哈顿距离,则可以写参数p(明可夫斯基距离的参数),这个也是超参数)

# 两种方式进行比较
best_method = ""
best_score = 0.0
best_k = -1
for method in ["uniform","distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method, p=2)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method
print("best_method = ", method)
print("best_k = ", best_k)
print("best_score = ", best_score)
best_method =  distance
best_k =  4
best_score =  0.9916666666666667

3、超参数网格搜索

sklearn中专门封装了一个超参数网格搜索方法Grid Serach

在进行网格搜索之前,首先需要定义一个搜索的参数param_search。是一个数组,数组中的每个元素是个字典,字典中的是对应的一组网格搜索,每一组网格搜索是这一组网格搜索每个参数的取值范围。键是参数的名称,值是键所对应的参数的列表。

param_search = [
    {        "weights":["uniform"],        "n_neighbors":[i for i in range(1,11)]
    },
    {        "weights":["distance"],        "n_neighbors":[i for i in range(1,11)],        "p":[i for i in range(1,6)]
    }
]

当weights = uniform即不使用距离时,我们只搜索超参数k,当weights = distance即使用距离时,需要看超参数p使用那个距离公式。下面创建要进行网格搜索所对应的分类算法并调用网格搜索:

knn_clf = KNeighborsClassifier()
# 调用网格搜索方法
from sklearn.model_selection import GridSearchCV
# 定义网格搜索的对象grid_search,其构造函数的第一个参数表示对哪一个分类器进行算法搜索,第二个参数表示网格搜索相应的参数
grid_search = GridSearchCV(knn_clf, param_search)

下面就是针对X_train, y_train,使用grid_search在param_search列表中寻找最佳超参数组:

%%time
grid_search.fit(X_train, y_train)
C:\Users\86139\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
  warnings.warn(CV_WARNING, FutureWarning)


Wall time: 1min 44s





GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=None,
             param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'weights': ['uniform']},
                         {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

可以使用网格搜索的评估函数来返回最佳分类起所对应的参数

# 返回的是网格搜索搜索到的最佳的分类器对应的参数 
grid_search.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=3,
                     weights='distance')

也可以查看最佳参数的分类器的准确度。

best_estimator_和best_score_参数后面有一个_。这是一种常见的语法规范,不是用户传入的参数,而是根据用户传入的规则,自己计算出来的结果,参数名字后面接 _

grid_search.best_score_
0.9853862212943633
grid_search.best_params_
{'n_neighbors': 3, 'p': 3, 'weights': 'distance'}
knn_clf = grid_search.best_estimator_
knn_clf.score(X_test, y_test)
0.9833333333333333
  • 为了验证模型的好坏,将数据集划分为训练数据集和测试数据集,这样我们就可以对测试数据集的进行预测,然后使用label进行验证。

  • 在我们得到了分类结果之后,就可以使用分类正确的数据点比上总的测试数据点,这样就可以计算出accuracy分类精准度。

  • 使用kNN算法对手写数字分类 当然,不同的评价指标有不同的使用场景,不能乱用。

最后我们以kNN算法为例,探究了不同的超参数对模型的影响,使用sklearn中封装好的网格搜索算法,可以帮助我们进行基础调参。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值