使用朴素贝叶斯进行个人信用风险评估

朴素贝叶斯

朴素贝叶斯方法是基于贝叶斯定理的一组有监督学习算法,即“简单”地假设每对特征之间相互独立。 给定一个类别 y y y和一个从 x 1 x_1 x1 x n x_n xn的相关的特征向量,贝叶斯定理阐述了一下关系:
P ( y ∣ x 1 , … , x n ) = P ( y ) P ( x 1 , … , x n ∣ y ) P ( x 1 , … , x n ) P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots, x_n \mid y)}{P(x_1, \dots, x_n)} P(yx1,,xn)=P(x1,,xn)P(y)P(x1,,xny)
使用简单(naive)的假设-每对特征之间都相互独立:
P ( x i ∣ y , x 1 , … , x i − 1 , x i + 1 , … , x n ) = P ( x i ∣ y ) P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = P(x_i | y) P(xiy,x1,,xi1,xi+1,,xn)=P(xiy)
对于所有的 i i i都成立,这个关系式可以简化为:
P ( y ∣ x 1 , … , x n ) = P ( y ) ∏ i = 1 n P ( x i ∣ y ) P ( x 1 , … , x n ) P(y \mid x_1, \dots, x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i \mid y)}{P(x_1, \dots, x_n)} P(yx1,,xn)=P(x1,,xn)P(y)i=1nP(xiy)
由于在给定的输入中 P ( x 1 , . . . , x n ) P(x_1,...,x_n) P(x1,...,xn)是一个常量,用下面的分类规则:
\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\ \Rightarrow\ \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}
我们可以用最大后验(MAP)估计来估计 P ( y ) P(y) P(y) P ( x i ∣ y ) P(x_i \mid y) P(xiy);前者是训练集中类别 y y y的相对频率。
各种各样的的朴素贝叶斯分类器的差异大部分来自于处理 P ( x i ∣ y ) P(x_i \mid y) P(xiy)分布时的所做的假设不同。
尽管其假设过于简单,在很多实际情况下,朴素贝叶斯工作得很好,特别是文档分类垃圾邮件过滤。这些工作都要求一个小的训练集来估计必需参数。
相比于其他更复杂的方法,朴素贝叶斯学习器和分类器非常快。分类条件分布的解耦意味着可以独立单独地把每个特征视为一维分布来估计。这样反过来有助于缓解维度灾难带来的问题。
另一方面,尽管朴素贝叶斯被认为是一种相当不错的分类器,但却不是好的估计器(estimator),所以不能太过于重视从predict_proba输出的概率。

高斯朴素贝叶斯

GaussianNB实现了运用于分类的高斯朴素贝叶斯算法。特征的可能性(即概率)假设为高斯分布:
P ( x i ∣ y ) = 1 2 π σ y 2 exp ⁡ ( − ( x i − μ y ) 2 2 σ y 2 ) P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right) P(xiy)=2πσy2 1exp(2σy2(xiμy)2)
参数 \sigma_y 和 \mu_y 使用最大似然法估计。

多项分布朴素贝叶斯

MultinomialNB实现了服从多项分布数据的朴素贝叶斯算法,也是用于文本分类(这个领域中数据往往以词向量表示,尽管在实践中tf-idf向量在预测时表现良好)的两大经典朴素贝叶斯算法之一。分布参数由每类 y y y θ y = ( θ y 1 , … , θ y n ) \theta_y=(\theta_{y1},\ldots,\theta_{yn}) θy=(θy1,,θyn)向量决定, 式中 n n n是特征的数量(对于文本分类,是词汇量的大小) θ y i \theta_{yi} θyi是样本中属于类&y&中特征&i&概率&P(x_i \mid y)&。
参数&\theta_y&使用平滑过的最大似然估计法来估计,即相对频率计数:
θ ^ y i = N y i + α N y + α n \hat{\theta}_{yi} = \frac{ N_{yi} + \alpha}{N_y + \alpha n} θ^yi=Ny+αnNyi+α
式中 N y i = ∑ x ∈ T x i N_{yi}=\sum_{x \in T}x_i Nyi=xTxi是训练集T中特征 i i i在类 y y y中出现的次数, N y = ∑ i = 1 ∣ T ∣ N y i N_y=\sum_{i=1}^{|T|}N_{yi} Ny=i=1TNyi是类 y y y中出现所有特征的计数总和。

先验平滑因子 α ≥ 0 \alpha \ge 0 α0为在学习样本中没有出现的特征而设计,以防在将来的计算中出现0概率输出。
把$\alpha = 1 被 称 为 拉 普 拉 斯 平 滑 ( L a p a l c e s m o o t h i n g ) , 而 被称为拉普拉斯平滑(Lapalce smoothing),而 (Lapalcesmoothing)\alpha < 1$被称为Lidstone平滑方法(Lidstone smoothing)。

伯努利朴素贝叶斯

BernoulliNB实现了用于多重伯努利分布数据的朴素贝叶斯训练和分类算法,即有多个特征,但每个特征都假设是一个二元(Bernoulli, boolean)变量。因此,这类算法要求样本以二元值特征向量表示;如果样本含有其他类型的数据,一个BernoulliNB实例会将其二值化(取决于binarize参数)。

伯努利朴素贝叶斯的决策规则基于:

P ( x i ∣ y ) = P ( i ∣ y ) x i + ( 1 − P ( i ∣ y ) ) ( 1 − x i ) P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i) P(xiy)=P(iy)xi+(1P(iy))(1xi)

与多项分布朴素贝叶斯的规则不同 伯努利朴素贝叶斯明确地惩罚类 y y y中没有出现作为预测因子的特征 i i i,而多项分布分布朴素贝叶斯只是简单地忽略没出现的特征。

在文本分类的示例中,统计词语是否出现的向量(word occurrence vectors)(而非统计词语出现次数的向量(word count vectors))可以用于训练和使用这个分类器。BernoulliNB可能在一些数据集上表现得更好,特别是那些更短的文档。

使用朴素贝叶斯进行个人信用风险评估

数据源与查看数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

credit = pd.read_csv("./input/credit.csv")
credit.head(5)
checking_balancemonths_loan_durationcredit_historypurposeamountsavings_balanceemployment_lengthinstallment_ratepersonal_statusother_debtors...propertyageinstallment_planhousingexisting_creditsjobdependentstelephoneforeign_workerdefault
0< 0 DM6criticalradio/tv1169unknown> 7 yrs4single malenone...real estate67noneown2skilled employee1yesyes1
11 - 200 DM48repaidradio/tv5951< 100 DM1 - 4 yrs2femalenone...real estate22noneown1skilled employee1noneyes2
2unknown12criticaleducation2096< 100 DM4 - 7 yrs2single malenone...real estate49noneown1unskilled resident2noneyes1
3< 0 DM42repaidfurniture7882< 100 DM4 - 7 yrs2single maleguarantor...building society savings45nonefor free1skilled employee2noneyes1
4< 0 DM24delayedcar (new)4870< 100 DM1 - 4 yrs3single malenone...unknown/none53nonefor free2skilled employee2noneyes2

5 rows × 21 columns

数据预处理

checking_balance,credit_history,purpose,savings_balance,employment_length,personal_status,other_debtors,property,installment_plan,housing,job,telephone,foreign_worker为字符串类型形式的变量,需要预处理使用整数进行编码。

col_dicts = {}
cols = ['checking_balance','credit_history', 'purpose', 'savings_balance', 'employment_length', 'personal_status', 
        'other_debtors','property','installment_plan','housing','job','telephone','foreign_worker']

col_dicts = {'checking_balance': {'unknown': 0,
                                  '< 0 DM': 1,
                                  '1 - 200 DM': 2,
                                  '> 200 DM': 3
                                 },
 'credit_history': {'critical': 0,
                    'repaid': 1,
                    'delayed': 2,
                    'fully repaid': 3,
                    'fully repaid this bank': 4
                   },
 'employment_length': {'unemployed': 0,
                       '0 - 1 yrs': 1,
                       '1 - 4 yrs': 2,
                       '4 - 7 yrs': 3,
                       '> 7 yrs': 4
                      },
 'foreign_worker': {'yes': 0 ,'no': 1},
 'housing': {'own': 0, 'for free': 1,  'rent': 2},
 'installment_plan': {'none': 0, 'bank': 1, 'stores': 2},
 'job': {'unemployed non-resident': 0,
         'unskilled resident': 1,
         'skilled employee': 2,
         'mangement self-employed': 3
        },
 'other_debtors': {'none': 0, 
                   'guarantor': 1,
                   'co-applicant': 2 },
 'personal_status': {'single male': 0,
                     'female': 1,
                     'divorced male': 2,
                     'married male': 3
                    },
 'property': {'real estate': 0,
              'building society savings': 1,
               'unknown/none': 2,
              'other': 3
             },
 'purpose': {'radio/tv': 0,
             'education': 1,
             'furniture': 2,
             'car (new)': 3,
             'car (used)': 4,
             'business': 5,
             'domestic appliances': 6,
             'repairs': 7,
             'others': 8,
             'retraining': 9},
 'savings_balance': {'unknown': 0,
                     '< 100 DM': 1,
                     '101 - 500 DM': 2,
                     '501 - 1000 DM': 3,
                     '> 1000 DM': 4
                    },
 'telephone': {'none': 1, 'yes': 0}}

for col in cols:
    credit[col] = credit[col].map(col_dicts[col])
    

credit.head(5)
checking_balancemonths_loan_durationcredit_historypurposeamountsavings_balanceemployment_lengthinstallment_ratepersonal_statusother_debtors...propertyageinstallment_planhousingexisting_creditsjobdependentstelephoneforeign_workerdefault
01600116904400...06700221001
124810595112210...02200121102
201201209613200...04900112101
314212788213201...14501122101
412423487012300...25301222102

5 rows × 21 columns

特征分析

获取特征的相关性矩阵,可以查看各变量之间的依赖关系。

import numpy as np

corrmat=credit.corr()#获取相关性矩阵
corrmat
checking_balancemonths_loan_durationcredit_historypurposeamountsavings_balanceemployment_lengthinstallment_ratepersonal_statusother_debtors...propertyageinstallment_planhousingexisting_creditsjobdependentstelephoneforeign_workerdefault
checking_balance1.0000000.0350500.1382100.0172720.024561-0.005614-0.108536-0.0579420.0699460.041970...-0.005623-0.0490580.0335660.032925-0.093081-0.054255-0.0408890.039209-0.0002050.197788
months_loan_duration0.0350501.0000000.1426310.1053050.624984-0.0645260.0573810.074749-0.1160290.006711...0.245655-0.0361360.0769920.011950-0.0112840.210910-0.023834-0.164718-0.1381960.214927
credit_history0.1382100.1426311.0000000.1439380.1137760.019657-0.097325-0.024740-0.005519-0.008955...0.071606-0.0700460.2394310.077417-0.2079600.0017180.0518490.018283-0.0417840.232157
purpose0.0172720.1053050.1439381.0000000.2032340.005263-0.052126-0.092747-0.035918-0.020423...0.0271610.0660200.0494890.0284640.0719950.0254090.077245-0.1160310.0356550.051311
amount0.0245610.6249840.1137760.2032341.000000-0.107538-0.008367-0.271316-0.1594340.037921...0.2245500.0327160.0458150.0561190.0207950.2853850.017142-0.276995-0.0500500.154739
savings_balance-0.005614-0.0645260.0196570.005263-0.1075381.0000000.014600-0.0008050.062953-0.047575...-0.004121-0.0179970.0093730.003268-0.004176-0.040803-0.0213020.0374520.005318-0.033871
employment_length-0.1085360.057381-0.097325-0.052126-0.0083670.0146001.0000000.126161-0.181745-0.028758...0.0655330.256227-0.008676-0.0445830.1257910.1012250.097192-0.060518-0.027232-0.116002
installment_rate-0.0579420.074749-0.024740-0.092747-0.271316-0.0008050.1261611.000000-0.081121-0.014835...0.0393530.0582660.034750-0.0739550.0216690.097755-0.071207-0.014413-0.0900240.072404
personal_status0.069946-0.116029-0.005519-0.035918-0.1594340.062953-0.181745-0.0811211.000000-0.011880...-0.099575-0.186563-0.0654610.083146-0.089640-0.064335-0.2383270.0572070.0092040.042643
other_debtors0.0419700.006711-0.008955-0.0204230.037921-0.047575-0.028758-0.014835-0.0118801.000000...-0.101378-0.028294-0.0009550.036219-0.017662-0.021106-0.0109900.0509960.1076390.028441
residence_history-0.0595550.034067-0.0279890.0736510.028926-0.0117720.2450810.049302-0.106742-0.012690...0.0552600.266419-0.0345170.2551060.0896250.0126550.042643-0.095359-0.0540970.002967
property-0.0056230.2456550.0716060.0271610.224550-0.0041210.0655330.039353-0.099575-0.101378...1.000000-0.0541860.0411470.0224200.0012090.244946-0.041111-0.155051-0.1387720.090146
age-0.049058-0.036136-0.0700460.0660200.032716-0.0179970.2562270.058266-0.186563-0.028294...-0.0541861.0000000.021858-0.1084370.1492540.0156730.118201-0.145259-0.006151-0.091127
installment_plan0.0335660.0769920.2394310.0494890.0458150.009373-0.0086760.034750-0.065461-0.000955...0.0411470.0218581.000000-0.0776240.0469930.0098720.057595-0.030704-0.0367340.104885
housing0.0329250.0119500.0774170.0284640.0561190.003268-0.044583-0.0739550.0831460.036219...0.022420-0.108437-0.0776241.000000-0.0526090.015201-0.0150040.0033070.0051550.123815
existing_credits-0.093081-0.011284-0.2079600.0719950.020795-0.0041760.1257910.021669-0.089640-0.017662...0.0012090.1492540.046993-0.0526091.000000-0.0263210.109667-0.065553-0.009717-0.045732
job-0.0542550.2109100.0017180.0254090.285385-0.0408030.1012250.097755-0.064335-0.021106...0.2449460.0156730.0098720.015201-0.0263211.000000-0.093559-0.383022-0.1009440.032735
dependents-0.040889-0.0238340.0518490.0772450.017142-0.0213020.097192-0.071207-0.238327-0.010990...-0.0411110.1182010.057595-0.0150040.109667-0.0935591.0000000.0147530.077071-0.003015
telephone0.039209-0.1647180.018283-0.116031-0.2769950.037452-0.060518-0.0144130.0572070.050996...-0.155051-0.145259-0.0307040.003307-0.065553-0.3830220.0147531.0000000.1074010.036466
foreign_worker-0.000205-0.138196-0.0417840.035655-0.0500500.005318-0.027232-0.0900240.0092040.107639...-0.138772-0.006151-0.0367340.005155-0.009717-0.1009440.0770710.1074011.000000-0.082079
default0.1977880.2149270.2321570.0513110.154739-0.033871-0.1160020.0724040.0426430.028441...0.090146-0.0911270.1048850.123815-0.0457320.032735-0.0030150.036466-0.0820791.000000

21 rows × 21 columns

使用seaborn绘图库绘制出相关型矩阵热度图,各变量间相关度并不高。我们可以“简单”地假设每对特征之间相互独立。

import seaborn as sns
sns.set(font_scale=1.5)#字符大小设定
plt.figure(figsize=(15, 15))
hm=sns.heatmap(corrmat, cbar=True, square=True, yticklabels=credit.columns, xticklabels=credit.columns,cmap="YlGnBu")
plt.show()  

在这里插入图片描述

模型选择

先进行数据划分,需要将数据集分为训练集测试集两部分。其中训练集用来构建朴素贝叶斯模型,测试集用来评估模型性能。

from sklearn import model_selection
from sklearn import metrics

y = credit['default']
#del credit['default']
X = credit.loc[:,'checking_balance':'foreign_worker']

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=1)

使用多项分布朴素贝叶斯模型,对训练数据集进行拟合。predict_proba输出对于X_test中的每行数据得到的对于两种预测结果的后验概率。因此被分类到后验概率较大的一类中。

from sklearn.naive_bayes import MultinomialNB
clf_multi = MultinomialNB()
clf_multi.fit(X_train,y_train)
y_pred = clf_multi.predict(X_test)

print(clf_multi.predict_proba(X_test))
[[7.08355206e-09 9.99999993e-01]
 [4.05129188e-26 1.00000000e+00]
 [9.79691264e-01 2.03087364e-02]
 [9.82619335e-01 1.73806654e-02]
 [4.94859998e-04 9.99505140e-01]
 [9.99998409e-01 1.59110968e-06]
 [8.85299264e-04 9.99114701e-01]
 [9.99912434e-01 8.75657002e-05]
 [9.99998245e-01 1.75510442e-06]
 [9.87150312e-01 1.28496881e-02]
 [9.99791955e-01 2.08045167e-04]
 [7.17413046e-02 9.28258695e-01]
 [1.42769922e-09 9.99999999e-01]
 [9.99997304e-01 2.69595250e-06]
 [5.61504493e-08 9.99999944e-01]
 [9.99995078e-01 4.92187029e-06]
 [8.54170929e-01 1.45829071e-01]
 [9.99988770e-01 1.12298436e-05]
 [9.99290453e-01 7.09546685e-04]
 [5.59657372e-14 1.00000000e+00]
 [9.99999766e-01 2.33633118e-07]
 [9.99991893e-01 8.10675856e-06]
 [9.27312798e-01 7.26872020e-02]
 [9.99999972e-01 2.81816265e-08]
 [9.99572578e-01 4.27421788e-04]
 [9.99554836e-01 4.45164490e-04]
 [7.54262725e-11 1.00000000e+00]
 [9.99583183e-01 4.16817416e-04]
 [9.99999892e-01 1.07557333e-07]
 [2.03671867e-20 1.00000000e+00]
 [9.30936726e-03 9.90690633e-01]
 [4.22137734e-01 5.77862266e-01]
 [3.07988013e-24 1.00000000e+00]
 [9.99892134e-01 1.07866386e-04]
 [9.98896579e-01 1.10342119e-03]
 [1.27516939e-19 1.00000000e+00]
 [3.21326686e-13 1.00000000e+00]
 [9.99978102e-01 2.18978094e-05]
 [9.99543666e-01 4.56334493e-04]
 [9.75120153e-01 2.48798467e-02]
 [9.99783249e-01 2.16751016e-04]
 [7.53225441e-01 2.46774559e-01]
 [1.00000000e+00 1.52390295e-10]
 [9.23327518e-01 7.66724815e-02]
 [9.99999996e-01 3.96981750e-09]
 [9.99201910e-01 7.98090310e-04]
 [7.42536445e-01 2.57463555e-01]
 [9.99779916e-01 2.20084349e-04]
 [9.97465543e-01 2.53445731e-03]
 [9.99522680e-01 4.77319741e-04]
 [9.99729186e-01 2.70814335e-04]
 [9.99736721e-01 2.63278865e-04]
 [5.64344942e-04 9.99435655e-01]
 [4.00026914e-05 9.99959997e-01]
 [9.53117144e-06 9.99990469e-01]
 [8.69420434e-01 1.30579566e-01]
 [9.99999997e-01 3.07306827e-09]
 [1.57057604e-06 9.99998429e-01]
 [9.99530449e-01 4.69551156e-04]
 [4.44235731e-07 9.99999556e-01]
 [9.99997479e-01 2.52079810e-06]
 [9.99985346e-01 1.46542234e-05]
 [3.09048564e-19 1.00000000e+00]
 [7.49952563e-01 2.50047437e-01]
 [9.98645456e-01 1.35454403e-03]
 [5.61016777e-15 1.00000000e+00]
 [9.99599379e-01 4.00620678e-04]
 [1.00000000e+00 4.94688483e-10]
 [9.93921444e-01 6.07855588e-03]
 [2.55830477e-12 1.00000000e+00]
 [9.98128166e-01 1.87183359e-03]
 [9.40583266e-01 5.94167339e-02]
 [9.99999664e-01 3.35665419e-07]
 [9.99965174e-01 3.48257351e-05]
 [7.96495634e-01 2.03504366e-01]
 [2.30045586e-01 7.69954414e-01]
 [3.43845989e-01 6.56154011e-01]
 [9.99999399e-01 6.00685647e-07]
 [9.99092122e-01 9.07878038e-04]
 [9.99932474e-01 6.75263730e-05]
 [3.18979875e-11 1.00000000e+00]
 [9.79179080e-01 2.08209204e-02]
 [1.00000000e+00 1.36448477e-11]
 [9.99993882e-01 6.11801758e-06]
 [9.65786631e-01 3.42133689e-02]
 [9.76111368e-01 2.38886319e-02]
 [9.99999941e-01 5.85847263e-08]
 [9.99999819e-01 1.80804466e-07]
 [9.99922821e-01 7.71793093e-05]
 [9.99993133e-01 6.86748630e-06]
 [9.99268149e-01 7.31851195e-04]
 [5.22545593e-01 4.77454407e-01]
 [1.97973112e-10 1.00000000e+00]
 [9.88164915e-01 1.18350852e-02]
 [9.99969019e-01 3.09809787e-05]
 [1.20452595e-05 9.99987955e-01]
 [9.95753421e-01 4.24657901e-03]
 [1.50879629e-01 8.49120371e-01]
 [3.69758990e-03 9.96302410e-01]
 [1.32728342e-07 9.99999867e-01]
 [9.67105940e-01 3.28940603e-02]
 [1.00000000e+00 2.51889396e-10]
 [4.20607438e-05 9.99957939e-01]
 [9.99999995e-01 4.50441505e-09]
 [9.98196982e-01 1.80301823e-03]
 [2.60895496e-01 7.39104504e-01]
 [9.99964355e-01 3.56449651e-05]
 [3.40718013e-08 9.99999966e-01]
 [6.68017653e-28 1.00000000e+00]
 [9.99982158e-01 1.78417815e-05]
 [9.99999723e-01 2.77421602e-07]
 [9.99797624e-01 2.02375590e-04]
 [8.59527536e-01 1.40472464e-01]
 [1.81821662e-08 9.99999982e-01]
 [1.89849679e-08 9.99999981e-01]
 [1.90552896e-07 9.99999809e-01]
 [9.99991452e-01 8.54833987e-06]
 [9.99999999e-01 8.90283237e-10]
 [9.86238948e-01 1.37610516e-02]
 [9.19056892e-01 8.09431077e-02]
 [5.20189940e-15 1.00000000e+00]
 [1.07176072e-06 9.99998928e-01]
 [1.96524734e-08 9.99999980e-01]
 [9.22312119e-02 9.07768788e-01]
 [9.99999561e-01 4.38736439e-07]
 [3.14262425e-02 9.68573758e-01]
 [9.99656508e-01 3.43491997e-04]
 [4.49764172e-01 5.50235828e-01]
 [9.99551027e-01 4.48973406e-04]
 [4.67593429e-01 5.32406571e-01]
 [9.99995983e-01 4.01738206e-06]
 [9.79193431e-01 2.08065690e-02]
 [4.33184050e-06 9.99995668e-01]
 [9.99993077e-01 6.92318920e-06]
 [9.99999812e-01 1.88485229e-07]
 [3.02205341e-06 9.99996978e-01]
 [9.07568653e-01 9.24313468e-02]
 [1.22717545e-01 8.77282455e-01]
 [9.99982544e-01 1.74558917e-05]
 [9.99999996e-01 3.78412084e-09]
 [5.61424124e-05 9.99943858e-01]
 [9.82620365e-01 1.73796352e-02]
 [1.94171276e-05 9.99980583e-01]
 [9.83179110e-01 1.68208896e-02]
 [9.99971324e-01 2.86761851e-05]
 [1.86978428e-09 9.99999998e-01]
 [7.06519691e-01 2.93480309e-01]
 [8.80296829e-01 1.19703171e-01]
 [9.87132431e-01 1.28675686e-02]
 [1.65934444e-08 9.99999983e-01]
 [9.99999995e-01 5.22521465e-09]
 [9.97170829e-01 2.82917130e-03]
 [9.99995505e-01 4.49460473e-06]
 [9.97536742e-01 2.46325758e-03]
 [1.17003871e-05 9.99988300e-01]
 [2.75965897e-01 7.24034103e-01]
 [4.72459215e-04 9.99527541e-01]
 [9.99603650e-01 3.96350493e-04]
 [9.99993266e-01 6.73357777e-06]
 [9.95930147e-01 4.06985314e-03]
 [9.98430108e-01 1.56989215e-03]
 [1.02950719e-14 1.00000000e+00]
 [9.95504721e-01 4.49527932e-03]
 [9.88755899e-01 1.12441013e-02]
 [5.76970096e-29 1.00000000e+00]
 [9.99994030e-01 5.96964214e-06]
 [8.04594587e-01 1.95405413e-01]
 [2.73498848e-02 9.72650115e-01]
 [9.98062495e-01 1.93750460e-03]
 [9.99976044e-01 2.39560643e-05]
 [5.51307112e-05 9.99944869e-01]
 [9.99999921e-01 7.91559779e-08]
 [9.99999870e-01 1.30319104e-07]
 [9.99999957e-01 4.27677987e-08]
 [8.68652943e-01 1.31347057e-01]
 [9.99878314e-01 1.21685501e-04]
 [9.97220154e-01 2.77984582e-03]
 [9.99998005e-01 1.99475475e-06]
 [1.12048195e-01 8.87951805e-01]
 [9.98556552e-01 1.44344822e-03]
 [4.74835052e-01 5.25164948e-01]
 [8.85006321e-01 1.14993679e-01]
 [9.99791624e-01 2.08376168e-04]
 [7.14924653e-14 1.00000000e+00]
 [9.11613644e-01 8.83863562e-02]
 [9.99168495e-01 8.31504595e-04]
 [9.99999999e-01 1.40036211e-09]
 [8.95294053e-01 1.04705947e-01]
 [9.94302903e-01 5.69709688e-03]
 [2.58934387e-12 1.00000000e+00]
 [9.99968031e-01 3.19694335e-05]
 [1.00483240e-01 8.99516760e-01]
 [9.99883869e-01 1.16131177e-04]
 [9.99998999e-01 1.00050972e-06]
 [9.41217448e-01 5.87825518e-02]
 [9.99999144e-01 8.56187393e-07]
 [3.04245716e-02 9.69575428e-01]
 [9.99950646e-01 4.93539798e-05]
 [9.94764447e-01 5.23555342e-03]
 [9.99725874e-01 2.74126170e-04]
 [9.99675935e-01 3.24064676e-04]
 [9.99988898e-01 1.11021676e-05]
 [4.96753226e-01 5.03246774e-01]
 [9.92510677e-01 7.48932261e-03]
 [3.71443861e-02 9.62855614e-01]
 [9.99880010e-01 1.19989622e-04]
 [5.43341069e-01 4.56658931e-01]
 [2.23740846e-02 9.77625915e-01]
 [2.16734441e-07 9.99999783e-01]
 [8.28053468e-03 9.91719465e-01]
 [6.42953055e-04 9.99357047e-01]
 [2.31551620e-05 9.99976845e-01]
 [9.99999954e-01 4.59849329e-08]
 [9.99999995e-01 4.68993640e-09]
 [9.99999996e-01 4.44118198e-09]
 [1.95727744e-13 1.00000000e+00]
 [9.97044910e-01 2.95509029e-03]
 [1.00000000e+00 4.70059652e-11]
 [4.91843429e-12 1.00000000e+00]
 [9.99151819e-01 8.48180737e-04]
 [9.99990247e-01 9.75330022e-06]
 [9.99900851e-01 9.91486749e-05]
 [5.19925707e-01 4.80074293e-01]
 [9.98595280e-01 1.40472032e-03]
 [9.99998228e-01 1.77218388e-06]
 [9.70866147e-01 2.91338528e-02]
 [7.48246743e-07 9.99999252e-01]
 [9.99973845e-01 2.61547528e-05]
 [1.20010195e-08 9.99999988e-01]
 [9.99989583e-01 1.04166523e-05]
 [3.28748185e-08 9.99999967e-01]
 [9.99999966e-01 3.38654087e-08]
 [5.05940992e-03 9.94940590e-01]
 [9.99980420e-01 1.95803984e-05]
 [9.99748556e-01 2.51444300e-04]
 [9.99882532e-01 1.17467621e-04]
 [9.98448749e-01 1.55125101e-03]
 [9.99999884e-01 1.16129977e-07]
 [1.53291657e-03 9.98467083e-01]
 [9.76220766e-01 2.37792338e-02]
 [8.44717999e-03 9.91552820e-01]
 [9.99886188e-01 1.13812358e-04]
 [3.90368809e-03 9.96096312e-01]
 [9.99999858e-01 1.42492789e-07]
 [9.99999698e-01 3.01588597e-07]
 [9.30141719e-01 6.98582809e-02]
 [9.99985365e-01 1.46346301e-05]
 [9.99920174e-01 7.98259279e-05]
 [9.99996587e-01 3.41321375e-06]
 [9.96987845e-01 3.01215503e-03]
 [9.99483133e-01 5.16866706e-04]
 [9.91463705e-01 8.53629509e-03]
 [3.15926974e-10 1.00000000e+00]
 [2.14783690e-03 9.97852163e-01]
 [7.50823010e-01 2.49176990e-01]
 [9.99999137e-01 8.62819165e-07]
 [6.99934104e-03 9.93000659e-01]
 [5.46894966e-15 1.00000000e+00]
 [2.13290238e-05 9.99978671e-01]
 [9.99793159e-01 2.06840609e-04]
 [8.74970112e-01 1.25029888e-01]
 [9.99867579e-01 1.32420724e-04]
 [1.00000000e+00 2.04284020e-10]
 [9.99997195e-01 2.80502361e-06]
 [9.99999157e-01 8.43225198e-07]
 [9.99999948e-01 5.17726713e-08]
 [2.40872121e-02 9.75912788e-01]
 [9.47413892e-01 5.25861076e-02]
 [6.54889629e-09 9.99999993e-01]
 [3.40621192e-06 9.99996594e-01]
 [4.30630753e-01 5.69369247e-01]
 [9.99993574e-01 6.42617921e-06]
 [9.98944910e-01 1.05508978e-03]
 [9.16489524e-01 8.35104757e-02]
 [9.99418072e-01 5.81928176e-04]
 [5.40842574e-04 9.99459157e-01]
 [9.99998061e-01 1.93903762e-06]
 [3.47609654e-02 9.65239035e-01]
 [9.99812899e-01 1.87101404e-04]
 [9.99746283e-01 2.53716520e-04]
 [2.81628794e-04 9.99718371e-01]
 [9.99828951e-01 1.71048982e-04]
 [9.99969589e-01 3.04105630e-05]
 [8.55977708e-02 9.14402229e-01]
 [3.44376719e-11 1.00000000e+00]
 [9.99999699e-01 3.01499794e-07]
 [9.99998707e-01 1.29347853e-06]
 [9.99958928e-01 4.10716828e-05]
 [9.93314781e-01 6.68521864e-03]
 [3.21042931e-09 9.99999997e-01]
 [9.86472042e-01 1.35279576e-02]
 [9.99904973e-01 9.50269676e-05]
 [2.83671375e-04 9.99716329e-01]
 [9.98868733e-01 1.13126730e-03]
 [2.44074089e-01 7.55925911e-01]
 [4.94339246e-04 9.99505661e-01]
 [3.39748678e-05 9.99966025e-01]
 [9.99999809e-01 1.91303777e-07]
 [1.16822919e-03 9.98831771e-01]
 [9.99997637e-01 2.36342666e-06]]

模型评估

print('多项分布贝叶斯分类结果如下:')
print('验证集评分:')
print(clf_multi.score(X_test,y_test))
print("准确率:")
print(metrics.precision_score(y_test,y_pred))
print('混淆矩阵:')
print(metrics.confusion_matrix(y_true=y_test,y_pred=y_pred,labels=list(set(y))))

print(metrics.classification_report(y_test,y_pred))
多项分布贝叶斯分类结果如下:
验证集评分:
0.6233333333333333
准确率:
0.7537688442211056
混淆矩阵:
[[150  64]
 [ 49  37]]
              precision    recall  f1-score   support

           1       0.75      0.70      0.73       214
           2       0.37      0.43      0.40        86

    accuracy                           0.62       300
   macro avg       0.56      0.57      0.56       300
weighted avg       0.64      0.62      0.63       300

模型的准确率为75.38%,对模型验证集评分为62.33%,对于违规结果,召回率只有43%。因此使用多项分布的朴素贝叶斯,对于结果的预测还不够精准。

模型调优

下面考虑使用高斯朴素贝叶斯模型,对训练数据集进行拟合。

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print('高斯贝叶斯分类结果如下:')
print('验证集评分:')
print(clf.score(X_test,y_test))
print("准确率:")
print(metrics.precision_score(y_test,y_pred))

print('混淆矩阵:')
print(metrics.confusion_matrix(y_true=y_test,y_pred=y_pred,labels=list(set(y))))

print(metrics.classification_report(y_test,y_pred))

高斯贝叶斯分类结果如下:
验证集评分:
0.7233333333333334
准确率:
0.8046511627906977
混淆矩阵:
[[173  41]
 [ 42  44]]
              precision    recall  f1-score   support

           1       0.80      0.81      0.81       214
           2       0.52      0.51      0.51        86

    accuracy                           0.72       300
   macro avg       0.66      0.66      0.66       300
weighted avg       0.72      0.72      0.72       300

模型准确率提高到80.47%,对模型验证集评分提高到了72.33%,同时对于违规结果的召回率提高到了51%。

  • 2
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值