2.划分训练集和测试集
- import sklearn.model_selection as ms
ms.train_test_split(
输入集, 输出集, test_size=测试集占比,
ramdom_state=随机种子)
->训练输入, 测试输入, 训练输出, 测试输出
代码:split.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.model_selection as ms import sklearn.naive_bayes as nb import matplotlib.pyplot as mp x, y = [], [] with open('../../data/multiple1.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y, dtype=int) # 划分训练集和测试集 train_x, test_x, train_y, test_y = \ ms.train_test_split( x, y, test_size=0.25, random_state=7) # 朴素贝叶斯分类器 model = nb.GaussianNB() # 用训练集训练模型 model.fit(train_x, train_y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) # 用测试集测试模型 pred_test_y = model.predict(test_x) print((pred_test_y == test_y).sum() / pred_test_y.size) mp.figure('Naive Bayes Classification', facecolor='lightgray') mp.title('Naive Bayes Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80) mp.show()
3.交叉验证
- ms.cross_val_score(模型, 输入集, 输出集, cv=折叠数,
scoring=指标名)->指标值数组
指标:
- 精确度(accuracy):分类正确的样本数/总样本数
- 查准率(precision_weighted):针对每一个类别,预测正确的样本数比上预测出来的样本数
- 召回率(recall_weighted):针对每一个类别,预测正确的样本数比上实际存在的样本数
- f1得分(f1_weighted):
2x查准率x召回率/(查准率+召回率)
在交叉验证过程中,针对每一个折叠,计算所有类别的查准率、召回率或者f1得分,然后取各类别相应指标值的平均数,作为这一个折叠的评估指标,然后再将所有折叠的评估指标以数组的形式返回调用者。
代码:cv.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.model_selection as ms import sklearn.naive_bayes as nb import matplotlib.pyplot as mp x, y = [], [] with open('../../data/multiple1.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y, dtype=int) # 划分训练集和测试集 train_x, test_x, train_y, test_y = \ ms.train_test_split( x, y, test_size=0.25, random_state=7) # 朴素贝叶斯分类器 model = nb.GaussianNB() # 交叉验证 # 精确度 ac = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='accuracy') print(ac.mean()) # 查准率 pw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='precision_weighted') print(pw.mean()) # 召回率 rw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='recall_weighted') print(rw.mean()) # f1得分 fw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='f1_weighted') print(fw.mean()) # 用训练集训练模型 model.fit(train_x, train_y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) #