文章目录
前言
入门机器学习,记录学习日常,如有错误请多指正。
参考书目:机器学习及Python应用
数据集可在陈强教授主页下载
一、数据预处理
1.数据介绍
案例使用Wisconsin breast cancer数据演示KNN分类算法。此数据包含569位病人的观测值,以及与乳腺癌诊断有关的32个变量。
2.导入模块和数据文件
1)导入案例所需的全部模块
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
2)导入数据
cancer=load_breast_cancer()
3.数据概况
df=pd.DataFrame(cancer.data,columns=cancer.feature_names)#数据框化
df['diagnosis']=cancer.target
d={0:'malignant',1:'benign'}
df['diagnosis']=df['diagnosis'].map(d)#将diagnosis取值用0,1映射
print(df.shape)
pd.options.display.max_columns=40
print(df.head(2))
print(df.iloc[:,:3].describe())
(569, 31)#数据库形状
#前2个观测值
mean radius mean texture mean perimeter mean area mean smoothness \
0 17.99 10.38 122.8 1001.0 0.11840
1 20.57 17.77 132.9 1326.0 0.08474
mean compactness mean concavity mean concave points mean symmetry \
0 0.27760 0.3001 0.14710 0.2419
1 0.07864 0.0869 0.07017 0.1812
mean fractal dimension radius error texture error perimeter error \
0 0.07871 1.0950 0.9053 8.589
1 0.05667 0.5435 0.7339 3.398
area error smoothness error compactness error concavity error \
0 153.40 0.006399 0.04904 0.05373
1 74.08 0.005225 0.01308 0.01860
concave points error symmetry error fractal dimension error \
0 0.01587 0.03003 0.006193
1 0.01340 0.01389 0.003532
worst radius worst texture worst perimeter worst area worst smoothness \
0 25.38 17.33 184.6 2019.0 0.1622
1 24.99 23.41 158.8 1956.0 0.1238
worst compactness worst concavity worst concave points worst symmetry \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
worst fractal dimension diagnosis
0 0.11890 malignant
1 0.08902 malignant
#前三个特征变量描述性统计
mean radius mean texture mean perimeter
count 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033
std 3.524049 4.301036 24.298981
min 6.981000 9.710000 43.790000
25% 11.700000 16.170000 75.170000
50% 13.370000 18.840000 86.240000
75% 15.780000 21.800000 104.100000
max 28.110000 39.280000 188.500000
考察响应变量diagnosis的取值分布:
print(df.diagnosis.value_counts())#响应变量分布个数
print(df.diagnosis.value_counts(normalize=True))#分布占比
#响应变量分布个数
diagnosis
benign 357
malignant 212
Name: count, dtype: int64
#分布占比
diagnosis
benign 0.627417
malignant 0.372583
Name: proportion, dtype: float64
绘制第0个变量“meanradius”箱型图
sns.boxplot(x='diagnosis',y='mean radius',data=df)
plt.show()
4.分层抽样
X,y=load_breast_cancer(return_X_y=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,stratify=y,test_size=100,random_state=1)
5.变量标准化
由于特征变量取值范围差别较大,而KNN算法要求变量的变化幅度相近,故须将特征变量标准化,是均值为0,标准差为1
scaler=StandardScaler()
scaler.fit(X_train)
X_train_s=scaler.transform(X_train)
X_test_s=scaler.transform(X_test)#使用X_train的均值和标准差对X_test进行标准化
print(np.mean(X_train_s,axis=0))
print(np.std(X_train_s,axis=0))#训练集均值和标准差
print(np.mean(X_test_s,axis=0))
print(np.std(X_test_s,axis=0))#测试集均值和标准差
#训练集均值
[-3.90637534e-15 -2.70643493e-15 -1.39807616e-15 1.06713974e-15
1.61349257e-15 2.05160456e-15 4.99245279e-16 2.26779031e-16
3.48098712e-16 -3.06838184e-15 -8.82497108e-16 7.28628245e-16
2.35774442e-16 1.00985318e-15 -2.70335756e-16 -1.16088139e-15
7.29220048e-16 1.78014438e-16 -9.01908257e-16 1.36114763e-16
3.95324616e-16 1.41748731e-15 -1.76215356e-15 -8.55984319e-16
5.17472821e-15 -6.13581680e-16 -1.19739565e-15 -5.04216427e-17
2.16907752e-15 -3.46560023e-16]
#训练集标准差
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.]
#测试集均值
[ 0.14050556 -0.04267932 0.12302554 0.13954997 -0.15615241 -0.14618531
-0.09807264 -0.02216118 -0.18759101 -0.20940323 0.04900366 0.08401851
0.02506624 0.04613437 0.03621243 -0.10552296 -0.14824077 -0.15130022
-0.00907229 -0.07147309 0.10227099 -0.05743301 0.08490858 0.09985674
-0.16100375 -0.21253999 -0.19819604 -0.12526512 -0.23519823 -0.25375591]
#测试集标准差
[1.05397499 1.02257937 1.04087191 1.05488954 1.07166571 0.9320498
0.92732843 1.0454531 1.09354165 0.89024838 0.95281498 1.38625504
0.86942192 0.8548185 1.19578827 1.01041681 0.74346697 0.92648972
1.04119429 1.04874628 1.05140637 1.02940339 1.03090256 1.03471827
1.02203204 0.74304172 0.8088081 0.97303316 0.82810738 0.68655424]
结果显示标准化训练集的变量均值为0,标准差为1。而标准化后的测试集变量均值不为0,标准差不为1,这是因为使用了X_train的均值和标准差对X_test进行标准化。若在标准化时使用测试集信息,无异于泄露测试集信息,可能导致偏差,不可取。
二、KNN估计
model=KNeighborsClassifier(n_neighbors=5)#默认使用欧式距离度量,n_neighbors=5表示K值为5
model.fit(X_train_s,y_train)
pred=model.predict(X_test_s)
print(pd.crosstab(y_test,pred,rownames=['Actual'],colnames=['Predicted']))
print(model.score(X_test_s,y_test))
#混淆矩阵
Predicted 0 1
Actual
0 34 3
1 0 63
0.97#预测准确率
三、选择最优K值
1.使用for循环选择最优K值
scores=[]
ks=range(1,51)
for k in ks:
model=KNeighborsClassifier(n_neighbors=k)
model.fit(X_train_s,y_train)
score=model.score(X_test_s,y_test)
scores.append(score)
print(max(scores))
index_max=np.argmax(scores)#索引
print(f'Optimal K:{ks[index_max]}')
0.97#最优K值的预测准确率
Optimal K:3
2.预测准确率与K关系图
plt.plot(ks,scores,'o-')
plt.xlabel('K')
plt.axvline(ks[index_max],linewidth=1,linestyle='--',color='k')
plt.ylabel('Accuracy')
plt.title('KNN')
plt.tight_layout()
plt.show()
如上图,K=3,4,5,6,7,8处有多个并列最大值
3.错分率与K关系图
#错分率
errors=1-np.array(scores)
plt.plot(ks,errors,'o-')
plt.xlabel('K')
plt.axvline(ks[index_max],linewidth=1,linestyle='--',color='k')
plt.ylabel('Error Rate')
plt.title('KNN')
plt.tight_layout()
plt.show()
4.错分率与模型复杂度关系图
对于KNN算法,K越小,模型越复杂,越容易过拟合。可用1/K度量KNN的模型复杂度。
#用1/K度量模型复杂度
errors=1-np.array(scores)
ks_inverse=1/np.array(ks)
plt.plot(ks_inverse,errors,'o-')
plt.xlabel('1/K')
plt.ylabel('Error Rate')
plt.title('KNN')
plt.tight_layout()
plt.show()
5.使用10折交叉验证选择最优K值
以上通过测试集选择最优超参数K,提前泄露了测试集信息,可能会缩小测试误差。下面通过对训练集进行10折交叉验证选择最优参数K。
param_grid={'n_neighbors':range(1,51)}#以字典形式定义超参数网格
kfold=StratifiedKFold(n_splits=10,shuffle=True,random_state=1)
model=GridSearchCV(KNeighborsClassifier(),param_grid,cv=kfold)
model.fit(X_train_s,y_train)
print(model.best_params_)
print(model.score(X_test_s,y_test))
{'n_neighbors': 12}#最优参数K
0.96 #K=12时测试集预测准确率