I.准备
1.import...
In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import warnings
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn import tree
plt.rcParams['font.sans-serif'] = ['SimHei'] # 绘图时可以显示中文
plt.rcParams['axes.unicode_minus']=False # 绘图时显示负号
warnings.filterwarnings("ignore") # 不要显示警告
2.read data
In [9]:
cancer = pd.read_excel('C:\\Users\\91333\\Documents\\semester6\\data science\\week3\\Week3_CancerDataset.xlsx')
3.粗略认识
In [10]:
cancer.head(5)
Out[10]:
feature1 | feature2 | feature3 | feature4 | feature5 | feature6 | feature7 | feature8 | feature9 | feature10 | ... | feature22 | feature23 | feature24 | feature25 | feature26 | feature27 | feature28 | feature29 | feature30 | Label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | 0 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | 0 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | 0 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | 0 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | 0 |
5 rows × 31 columns
In [11]:
cancer.shape
Out[11]:
(569, 31)
569个观测;30个feature,都为连续型数据;1个label,为0-1变量
In [12]:
cancer.isnull().sum().sum() # 无缺失值
Out[12]:
0
II. Naive Bayes
1.画图观察
In [13]:
cancer_scaled = cancer.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
plt.figure()
for i in range(cancer.shape[1]-1):
sns.kdeplot(cancer_scaled.iloc[:,i], alpha=.7,label="")
plt.title("kdeplot of 30 features")
plt.show()
每一个feature的分布都比较近似正态分布,考虑高斯朴素贝叶斯
3. 划分训练集测试集
In [14]:
x_train, x_test, y_train, y_test = train_test_split(cancer.iloc[:,0:-1], cancer.iloc[:,-1], test_size=0.3, random_state=1)
4. training the model on training set
In [15]:
gnb = GaussianNB()
gnb.fit(x_train, y_train)
Out[15]:
GaussianNB(priors=None, var_smoothing=1e-09)
5. making predictions on the testing set
In [16]:
y_pred = gnb.predict(x_test)
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)
Gaussian Naive Bayes model accuracy(in %): 94.73684210526315
III. Decision Tree
对于tree.DecisionTreeClassifier的criterion参数有两个选项,entropy对应ID3算法,gini对应CART算法,分别尝试;max_depth的参数是整数,数据特征数为30个,较少,不设置max_depth,默认为None。
1. ID3
In [17]:
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(x_train, y_train)
y_pred_clf = clf.predict(x_test)
print("ID3 model accuracy(in %):", metrics.accuracy_score(y_test, y_pred_clf)*100)
ID3 model accuracy(in %): 90.05847953216374
2.CART
In [18]:
CART = tree.DecisionTreeClassifier(criterion='gini')
CART.fit(x_train, y_train)
y_pred_CART = CART.predict(x_test)
print("CART model accuracy(in %):", metrics.accuracy_score(y_test, y_pred_CART)*100)
CART model accuracy(in %): 93.56725146198829
综合三种算法,CART和高斯朴素贝叶斯的正确率都比较高,下面展示CART正确率的更多细节。
In [19]:
print(metrics.classification_report(y_pred_CART,y_test,target_names=['died','servived']))
precision recall f1-score support
died 0.87 0.95 0.91 58
servived 0.97 0.93 0.95 113
accuracy 0.94 171
macro avg 0.92 0.94 0.93 171
weighted avg 0.94 0.94 0.94 171