python 实例 Naive Bayes 决策树(ID3 CART)

I.准备

1.import...

In [8]:

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import warnings
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split  
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn import tree
plt.rcParams['font.sans-serif'] = ['SimHei']  # 绘图时可以显示中文
plt.rcParams['axes.unicode_minus']=False   # 绘图时显示负号
warnings.filterwarnings("ignore")  # 不要显示警告

2.read data

In [9]:

cancer = pd.read_excel('C:\\Users\\91333\\Documents\\semester6\\data science\\week3\\Week3_CancerDataset.xlsx') 

3.粗略认识

In [10]:

cancer.head(5)

Out[10]:

 feature1feature2feature3feature4feature5feature6feature7feature8feature9feature10...feature22feature23feature24feature25feature26feature27feature28feature29feature30Label
017.9910.38122.801001.00.118400.277600.30010.147100.24190.07871...17.33184.602019.00.16220.66560.71190.26540.46010.118900
120.5717.77132.901326.00.084740.078640.08690.070170.18120.05667...23.41158.801956.00.12380.18660.24160.18600.27500.089020
219.6921.25130.001203.00.109600.159900.19740.127900.20690.05999...25.53152.501709.00.14440.42450.45040.24300.36130.087580
311.4220.3877.58386.10.142500.283900.24140.105200.25970.09744...26.5098.87567.70.20980.86630.68690.25750.66380.173000
420.2914.34135.101297.00.100300.132800.19800.104300.18090.05883...16.67152.201575.00.13740.20500.40000.16250.23640.076780

5 rows × 31 columns

In [11]:

cancer.shape

Out[11]:

(569, 31)

569个观测;30个feature,都为连续型数据;1个label,为0-1变量

In [12]:

cancer.isnull().sum().sum()  # 无缺失值

Out[12]:

0

II. Naive Bayes

1.画图观察

In [13]:

cancer_scaled = cancer.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
plt.figure()
for i in range(cancer.shape[1]-1):
    sns.kdeplot(cancer_scaled.iloc[:,i], alpha=.7,label="")
plt.title("kdeplot of 30 features")
plt.show()

每一个feature的分布都比较近似正态分布,考虑高斯朴素贝叶斯

3. 划分训练集测试集

In [14]:

x_train, x_test, y_train, y_test = train_test_split(cancer.iloc[:,0:-1], cancer.iloc[:,-1], test_size=0.3, random_state=1)

4. training the model on training set

In [15]:

gnb = GaussianNB()
gnb.fit(x_train, y_train)

Out[15]:

GaussianNB(priors=None, var_smoothing=1e-09)

5. making predictions on the testing set

In [16]:

y_pred = gnb.predict(x_test)
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)
Gaussian Naive Bayes model accuracy(in %): 94.73684210526315

III. Decision Tree

对于tree.DecisionTreeClassifier的criterion参数有两个选项,entropy对应ID3算法,gini对应CART算法,分别尝试;max_depth的参数是整数,数据特征数为30个,较少,不设置max_depth,默认为None。

1. ID3

In [17]:

clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(x_train, y_train)
y_pred_clf = clf.predict(x_test)
print("ID3 model accuracy(in %):", metrics.accuracy_score(y_test, y_pred_clf)*100)
ID3 model accuracy(in %): 90.05847953216374

2.CART

In [18]:

CART = tree.DecisionTreeClassifier(criterion='gini')
CART.fit(x_train, y_train)
y_pred_CART = CART.predict(x_test)
print("CART model accuracy(in %):", metrics.accuracy_score(y_test, y_pred_CART)*100)
CART model accuracy(in %): 93.56725146198829

综合三种算法,CART和高斯朴素贝叶斯的正确率都比较高,下面展示CART正确率的更多细节。

In [19]:

print(metrics.classification_report(y_pred_CART,y_test,target_names=['died','servived']))
              precision    recall  f1-score   support

        died       0.87      0.95      0.91        58
    servived       0.97      0.93      0.95       113

    accuracy                           0.94       171
   macro avg       0.92      0.94      0.93       171
weighted avg       0.94      0.94      0.94       171
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值