一文由浅入深理解Naive Bayes分类算法

一文由浅入深理解Naive Bayes分类算法

1. 基本概率知识

2. 朴素贝叶斯分类

load data

数据清洗

特征处理

构建模型

模型和baseline比较

Confusion matrix

输出预测概率

基本概率知识

  1. 条件概率:在某个特定条件下的(特定样本空间)的特定概率
    p ( B ∣ S ) = p ( B ⋂ S ) p ( S ) p(B|S) = \frac{p(B\bigcap S)}{p(S)} p(BS)=p(S)p(BS)
  2. 乘法公式:
    p ( A B ) = p ( A ) p ( B ∣ A ) = p ( A ) p ( B ⋂ A ) p ( A ) p(AB) = p(A)p(B|A) = p(A)\frac{p(B\bigcap A)}{p(A)} p(AB)=p(A)p(BA)=p(A)p(A)p(BA)
    其中 p ( A ) p ( B ∣ A ) p(A)p(B|A) p(A)p(BA) B ⋂ A B\bigcap A BA相对于全样本空间 Ω \Omega Ω的概率, p ( B ∣ A ) p(B|A) p(BA) B ⋂ A B\bigcap A BA相对于样本空间A的概率
  3. bayes公式:
    p ( B i ∣ A ) = p ( A ∣ B i ) p ( B i ) ∑ i = 1 n p ( B i ) p ( A ∣ B i ) p(B_i|A) = \frac{p(A|B_i)p(B_i)}{\sum_{i=1}^np(B_i)p(A|B_i)} p(BiA)=i=1np(Bi)p(ABi)p(ABi)p(Bi)

举个栗子

某种罕见病检测结果的准确度为99%,在人群中患病者的比例是0.5%。小张去医院做检查,检测结果为阳性,小张获病的概率是多少

如果A表示“真的患病”,B表示“检测呈阳性”。即P(A) = 0.5%,求P(A|B)
P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( A − ) P ( B ∣ A − ) = 99 % ∗ 0.5 % 99 % ∗ 0.5 % + ( 1 − 0.5 % ) ( 1 − 99 % ) = 0.33 P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(A^-)P(B|A^-)} = \frac{99\% * 0.5\%}{99\% * 0.5\% + (1-0.5\%)(1-99\%)} = 0.33 P(AB)=P(BA)P(A)+P(A)P(BA)P(BA)P(A)=99%0.5%+(10.5%)(199%)99%0.5%=0.33
如果复查后,检测结果仍为阳,小张获病的概率是多少呢?
此时,阳性人群中的获病概率是0.33,则小张获病的概率是 P ( A ∣ B ) = P ( B ∣ A ) P ( A ) P ( B ∣ A ) P ( A ) + P ( A − ) P ( B ∣ A − ) = 99 % ∗ 0.33 99 % ∗ 0.33 + ( 1 − 0.33 ) ( 1 − 99 % ) = 0.98 P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(A^-)P(B|A^-)} = \frac{99\% * 0.33}{99\% * 0.33 + (1-0.33)(1-99\%)} = 0.98 P(AB)=P(BA)P(A)+P(A)P(BA)P(BA)P(A)=99%0.33+(10.33)(199%)99%0.33=0.98
所以说嘛,疫情期间,为啥第一次检测阳没有被拉走而是第二次阳才被拉走。

朴素贝叶斯分类

原理简介

垃圾邮件分类,假如收到了一封邮件D,判断是否是垃圾邮件。也就是求垃圾邮件 p ( h + ) p(h^+) p(h+)的概率和非垃圾邮件的概率 p ( h − ) p(h^-) p(h)哪个大。垃圾邮件的概率 p ( h + ∣ D ) = p ( h + ) p ( D ∣ h + ) p ( D ) p(h^+|D) = \frac{p(h^+)p(D|h^+)}{p(D)} p(h+D)=p(D)p(h+)p(Dh+),正常邮件的概率 p ( h − ∣ D ) = p ( h − ) p ( D ∣ h − ) p ( D ) p(h^-|D) = \frac{p(h^-)p(D|h^-)}{p(D)} p(hD)=p(D)p(h)p(Dh)。两个分母一样,也就是求分子的大小。 p ( h + ) p(h^+) p(h+) p ( h − ) p(h^-) p(h)可以从训练集很容易求出。邮件是由一个个单词组成,含有N个单词, d 1 , d 2 , . . . d n d_1,d_2,...d_n d1,d2,...dn, p ( D ∣ h + ) = p ( d 1 , d 2 , . . . d n ∣ h + ) = p ( d 1 ∣ h + ) p ( d 2 ∣ d 1 , h + ) ∗ p ( d 3 ∣ d 2 , d 1 , h + ) . . . p ( d n ∣ d n − 1 . . . d 1 , h + ) p(D|h^+) = p(d_1,d_2,...d_n|h^+) = p(d_1|h^+)p(d_2|d_1,h^+)*p(d_3|d_2,d_1,h^+)...p(d_n|d_{n-1}...d_1,h^+) p(Dh+)=p(d1,d2,...dnh+)=p(d1h+)p(d2d1,h+)p(d3d2,d1,h+)...p(dndn1...d1,h+).朴素贝叶斯是假设特征之间相互独立,则 p ( D ∣ h + ) p(D|h^+) p(Dh+)就可以简化成 p ( d 1 ∣ h + ) p ( d 2 ∣ h + ) ∗ p ( d 3 ∣ h + ) . . . p ( d n ∣ h + ) p(d_1|h^+)p(d_2|h^+)*p(d_3|h^+)...p(d_n|h^+) p(d1h+)p(d2h+)p(d3h+)...p(dnh+),这样统计垃圾邮件中每个单词出现的频率就可以了。

Demo

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

load data

df = pd.read_csv('.\\data\\adult.csv',header=None)
df.head()
01234567891011121314
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names
df['native_country'].unique()
array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
       ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
       ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
       ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
       ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
       ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
       ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
       ' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
       ' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)

数据清洗

categorical = [col for col in df.columns if df[col].dtype=='O']
for col in categorical:
    df[col].replace(' ?',np.NaN,inplace=True)
numerical = [col for col in df.columns if df[col].dtype!='O']
df[numerical].isnull().sum()
age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
dtype: int64
df['income'].unique()
array([' <=50K', ' >50K'], dtype=object)
df['income'] = df['income'].map({' <=50K':0,' >50K':1})
X = df.drop(['income'],axis=1)
y= df['income']
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

特征处理

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop = True)
categorical.remove('income')
X_test[categorical].isnull().mean()
workclass         0.057324
education         0.000000
marital_status    0.000000
occupation        0.057836
relationship      0.000000
race              0.000000
sex               0.000000
native_country    0.017300
dtype: float64
for df2 in [X_train,X_test]:
    df2['workclass'].fillna(df2['workclass'].mode()[0],inplace=True)
    df2['occupation'].fillna(df2['occupation'].mode()[0],inplace=True)
    df2['native_country'].fillna(df2['native_country'].mode()[0],inplace=True)
X_test[categorical].isnull().mean()
workclass         0.0
education         0.0
marital_status    0.0
occupation        0.0
relationship      0.0
race              0.0
sex               0.0
native_country    0.0
dtype: float64
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[categorical])
X_train_categorical = pd.DataFrame(enc.transform(X_train[categorical]).toarray())
column_name = enc.get_feature_names(categorical)
X_train_categorical.columns = column_name

X_train = X_train[numerical].join(X_train_categorical,on=X_train.index)
X_train.head()
agefnlwgteducation_numcapital_gaincapital_losshours_per_weekworkclass_ Federal-govworkclass_ Local-govworkclass_ Never-workedworkclass_ Private...native_country_ Portugalnative_country_ Puerto-Riconative_country_ Scotlandnative_country_ Southnative_country_ Taiwannative_country_ Thailandnative_country_ Trinadad&Tobagonative_country_ United-Statesnative_country_ Vietnamnative_country_ Yugoslavia
045170871972980600.00.00.01.0...0.00.00.00.00.00.00.01.00.00.0
147108890918310380.00.00.00.0...0.00.00.00.00.00.00.01.00.00.0
2481875051000500.00.00.01.0...0.00.00.00.00.00.00.01.00.00.0
329145592900400.00.00.01.0...0.00.00.00.00.00.00.00.00.00.0
423203003400250.00.00.01.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 105 columns

X_test_categorical = pd.DataFrame(enc.transform(X_test[categorical]).toarray())
column_name = enc.get_feature_names(categorical)
X_test_categorical.columns = column_name

X_test = X_test[numerical].join(X_test_categorical,on=X_test.index)
X_train.shape
(22792, 105)
cols = X_train.columns
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

X_test.shape
(9769, 105)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
X_train.head()
agefnlwgteducation_numcapital_gaincapital_losshours_per_weekworkclass_ Federal-govworkclass_ Local-govworkclass_ Never-workedworkclass_ Private...native_country_ Portugalnative_country_ Puerto-Riconative_country_ Scotlandnative_country_ Southnative_country_ Taiwannative_country_ Thailandnative_country_ Trinadad&Tobagonative_country_ United-Statesnative_country_ Vietnamnative_country_ Yugoslavia
00.40-0.058906-0.3333337298.00.04.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
10.50-0.578076-0.3333331831.00.0-0.40.00.00.0-1.0...0.00.00.00.00.00.00.00.00.00.0
20.550.0804250.0000000.00.02.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
3-0.40-0.270650-0.3333330.00.00.00.00.00.00.0...0.00.00.00.00.00.00.0-1.00.00.0
4-0.700.210240-2.0000000.00.0-3.00.00.00.00.0...0.00.00.00.00.00.00.0-1.00.00.0

5 rows × 105 columns

构建模型

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train,y_train)
GaussianNB()
y_pred = gnb.predict(X_test)
from sklearn.metrics import accuracy_score
print('Model accuracy score:{0:0.4f}'.format(accuracy_score(y_test,y_pred)))
Model accuracy score:0.8046
y_train_pred = gnb.predict(X_train)
print('Model accuracy score:{0:0.4f}'.format(accuracy_score(y_train,y_train_pred)))
Model accuracy score:0.8067

test和train的accuracy score差不多,因此模型没有over fit ing

模型和baseline比较

baseline模型对于分类问题来说,就是将测试集都预测为最频繁出现的类

y_train.value_counts()
 <=50K    17313
 >50K      5479
Name: income, dtype: int64

将测试集都预测为<=50K,计算accuracy

y_test.value_counts()
 <=50K    7407
 >50K     2362
Name: income, dtype: int64
print('baseline accuracy为{0:0.4f}'.format(7407/(7407+2362)))
baseline accuracy为0.7582

naive bayes模型准确率为0.8046 > 0.7582,则还是比baseline模型要好的

Confusion matrix

  • True Positive(TP): 预测为真,实际也为真
  • False Positive(FP):预测为真,实际为假,Type 1 error
  • True Negative(TN): 预测为假,实际也为假
  • False Negative(FN):预测为假,实际为真, Type 2 error
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print('Confusion matrix\n\n',cm)
print('\n TP = ',cm[0,0])
print('\nTN = ', cm[1,1])

print('\nFP = ', cm[0,1])

print('\nFN = ', cm[1,0])
Confusion matrix

 [[5954 1453]
 [ 456 1906]]

 TP =  5954

TN =  1906

FP =  1453

FN =  456
cm_matrix = pd.DataFrame(data=cm,columns = ['Actual Positive:1', 'Actual Negative:0'],index = ['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix,annot=True,fmt='d',cmap='YlGnBu')

在这里插入图片描述

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

       <=50K       0.93      0.80      0.86      7407
        >50K       0.57      0.81      0.67      2362

    accuracy                           0.80      9769
   macro avg       0.75      0.81      0.76      9769
weighted avg       0.84      0.80      0.81      9769
TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]
# print precision score

precision = TP / float(TP + FP)


print('Precision : {0:0.4f}'.format(precision))
Precision : 0.8038
recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))
Recall or Sensitivity : 0.9289
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, y_pred)
print('f1 score : {:.4f}'.format(f1_score))
f1 score : 0.6663
# TPR TRUE POSITIVE RATE:same as recall
true_positive_rate = TP / float(TP + FN)


print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))
True Positive Rate : 0.9289
#FPR false positive rate
false_positive_rate = FP / float(FP + TN)


print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))
False Positive Rate : 0.4326
# SPECIFICITY : 1- fpr
specificity = TN / (TN + FP)

print('Specificity : {0:0.4f}'.format(specificity))
Specificity : 0.5674

输出预测概率

一般模型默认,threshold为0.5,即>0.5,模型认为是class1 即>50K

y_pred_prob = gnb.predict_proba(X_test)
y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - <=50K', 'Prob of - >50K'])
y_pred_prob_df.head()
Prob of - <=50KProb of - >50K
09.999994e-015.741524e-07
19.996879e-013.120935e-04
21.544056e-018.455944e-01
31.736243e-049.998264e-01
48.201210e-091.000000e+00
y_pred1 = gnb.predict_proba(X_test)[:, 1]
plt.rcParams['font.size'] = 12


# plot histogram with 10 bins
plt.hist(y_pred1, bins = 10)


# set the title of predicted probabilities
plt.title('Histogram of predicted probabilities of salaries >50K')


# set the x-axis limit
plt.xlim(0,1)


# set the title
plt.xlabel('Predicted probabilities of salaries >50K')
plt.ylabel('Frequency')

在这里插入图片描述

from sklearn.metrics import roc_curve
fpr,tpr,threshold = roc_curve(y_test, y_pred1, pos_label = ' >50K')
plt.figure(figsize=(15,8))
plt.plot(fpr,tpr,linewidth=2)
plt.plot([0,1],[0,1],'k--')
plt.rcParams['font.size'] = 12
plt.title('ROC curve for Gaussian Naive Bayes Classifier for Predicting Salaries')

plt.xlabel('False Positive Rate (1 - Specificity)')

plt.ylabel('True Positive Rate (Sensitivity)')

在这里插入图片描述

from sklearn.metrics import roc_auc_score
ROC_AUC = roc_auc_score(y_test,y_pred1)
print('ROC AUC : {:.4f}'.format(ROC_AUC))
ROC AUC : 0.8932
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值