一文由浅入深理解Naive Bayes分类算法

最新推荐文章于 2024-09-13 22:00:20 发布

Sarah_07

最新推荐文章于 2024-09-13 22:00:20 发布

阅读量641

点赞数

文章标签：分类人工智能

本文链接：https://blog.csdn.net/Sarah_07/article/details/128922812

版权

一文由浅入深理解Naive Bayes分类算法

基本概率知识

条件概率：在某个特定条件下的（特定样本空间）的特定概率
$\frac{p(B\bigcap S)}{p(S)}$
乘法公式：
$p(A)\frac{p(B\bigcap A)}{p(A)}$
其中 $p (A) p (B ∣ A)$ 是 $B\bigcap A$ 相对于全样本空间 $\Omega$ 的概率， $p (B ∣ A)$ 是 $B\bigcap A$ 相对于样本空间A的概率
bayes公式：
$p(B_i|A) = \frac{p(A|B_i)p(B_i)}{\sum_{i=1}^np(B_i)p(A|B_i)}$

举个栗子

某种罕见病检测结果的准确度为99%，在人群中患病者的比例是0.5%。小张去医院做检查，检测结果为阳性，小张获病的概率是多少

如果A表示“真的患病”，B表示“检测呈阳性”。即P（A) = 0.5%，求P(A|B)
$\frac{P(B|A)P(A)}{P(B|A)P(A) + P(A^-)P(B|A^-)} = \frac{99\% * 0.5\%}{99\% * 0.5\% + (1-0.5\%)(1-99\%)} = 0.33$
如果复查后，检测结果仍为阳，小张获病的概率是多少呢？
此时，阳性人群中的获病概率是0.33，则小张获病的概率是 $\frac{P(B|A)P(A)}{P(B|A)P(A) + P(A^-)P(B|A^-)} = \frac{99\% * 0.33}{99\% * 0.33 + (1-0.33)(1-99\%)} = 0.98$
所以说嘛，疫情期间，为啥第一次检测阳没有被拉走而是第二次阳才被拉走。

朴素贝叶斯分类

原理简介

垃圾邮件分类，假如收到了一封邮件D，判断是否是垃圾邮件。也就是求垃圾邮件 $p(h^+)$ 的概率和非垃圾邮件的概率 $p(h^-)$ 哪个大。垃圾邮件的概率 $p(h^+|D) = \frac{p(h^+)p(D|h^+)}{p(D)}$ ，正常邮件的概率 $p(h^-|D) = \frac{p(h^-)p(D|h^-)}{p(D)}$ 。两个分母一样，也就是求分子的大小。 $p(h^+)$ 和 $p(h^-)$ 可以从训练集很容易求出。邮件是由一个个单词组成，含有N个单词， $d_1,d_2,...d_n$ , $p(D|h^+) = p(d_1,d_2,...d_n|h^+) = p(d_1|h^+)p(d_2|d_1,h^+)*p(d_3|d_2,d_1,h^+)...p(d_n|d_{n-1}...d_1,h^+)$ .朴素贝叶斯是假设特征之间相互独立，则 $p(D|h^+)$ 就可以简化成 $p(d_1|h^+)p(d_2|h^+)*p(d_3|h^+)...p(d_n|h^+)$ ,这样统计垃圾邮件中每个单词出现的频率就可以了。

Demo

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')

load data

df = pd.read_csv('.\\data\\adult.csv',header=None)

df.head()

	0	1	2	3	4	5	6	7	8	9	10	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

df['native_country'].unique()

array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
       ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
       ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
       ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
       ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
       ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
       ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
       ' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
       ' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)

数据清洗

categorical = [col for col in df.columns if df[col].dtype=='O']
for col in categorical:
    df[col].replace(' ?',np.NaN,inplace=True)

numerical = [col for col in df.columns if df[col].dtype!='O']
df[numerical].isnull().sum()

age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
dtype: int64

df['income'].unique()

array([' <=50K', ' >50K'], dtype=object)

df['income'] = df['income'].map({' <=50K':0,' >50K':1})

X = df.drop(['income'],axis=1)
y= df['income']

X = X.reset_index(drop=True)
y = y.reset_index(drop=True)

特征处理

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop = True)

categorical.remove('income')

X_test[categorical].isnull().mean()

workclass         0.057324
education         0.000000
marital_status    0.000000
occupation        0.057836
relationship      0.000000
race              0.000000
sex               0.000000
native_country    0.017300
dtype: float64

for df2 in [X_train,X_test]:
    df2['workclass'].fillna(df2['workclass'].mode()[0],inplace=True)
    df2['occupation'].fillna(df2['occupation'].mode()[0],inplace=True)
    df2['native_country'].fillna(df2['native_country'].mode()[0],inplace=True)

X_test[categorical].isnull().mean()

workclass         0.0
education         0.0
marital_status    0.0
occupation        0.0
relationship      0.0
race              0.0
sex               0.0
native_country    0.0
dtype: float64

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[categorical])
X_train_categorical = pd.DataFrame(enc.transform(X_train[categorical]).toarray())
column_name = enc.get_feature_names(categorical)
X_train_categorical.columns = column_name

X_train = X_train[numerical].join(X_train_categorical,on=X_train.index)

X_train.head()

	age	fnlwgt	education_num	capital_gain	hours_per_week	workclass_ Private	...	native_country_ United-States
0	45	170871	9	7298	60	1.0	...	1.0
1	47	108890	9	1831	38	0.0	...	1.0
2	48	187505	10	0	50	1.0	...	1.0
3	29	145592	9	0	40	1.0	...	0.0
4	23	203003	4	0	25	1.0	...	0.0

5 rows × 105 columns

X_test_categorical = pd.DataFrame(enc.transform(X_test[categorical]).toarray())
column_name = enc.get_feature_names(categorical)
X_test_categorical.columns = column_name

X_test = X_test[numerical].join(X_test_categorical,on=X_test.index)

X_train.shape

(22792, 105)

cols = X_train.columns

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

X_test.shape

(9769, 105)

X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])

X_train.head()

	age	fnlwgt	education_num	capital_gain	hours_per_week	workclass_ Private	...	native_country_ United-States
0	0.40	-0.058906	-0.333333	7298.0	4.0	0.0	...	0.0
1	0.50	-0.578076	-0.333333	1831.0	-0.4	-1.0	...	0.0
2	0.55	0.080425	0.000000	0.0	2.0	0.0	...	0.0
3	-0.40	-0.270650	-0.333333	0.0	0.0	0.0	...	-1.0
4	-0.70	0.210240	-2.000000	0.0	-3.0	0.0	...	-1.0

5 rows × 105 columns

构建模型

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train,y_train)

GaussianNB()

y_pred = gnb.predict(X_test)

from sklearn.metrics import accuracy_score

print('Model accuracy score:{0:0.4f}'.format(accuracy_score(y_test,y_pred)))

Model accuracy score:0.8046

y_train_pred = gnb.predict(X_train)
print('Model accuracy score:{0:0.4f}'.format(accuracy_score(y_train,y_train_pred)))

Model accuracy score:0.8067

test和train的accuracy score差不多，因此模型没有over fit ing

模型和baseline比较

baseline模型对于分类问题来说，就是将测试集都预测为最频繁出现的类

y_train.value_counts()

 <=50K    17313
 >50K      5479
Name: income, dtype: int64

将测试集都预测为<=50K,计算accuracy

y_test.value_counts()

 <=50K    7407
 >50K     2362
Name: income, dtype: int64

print('baseline accuracy为{0:0.4f}'.format(7407/(7407+2362)))

baseline accuracy为0.7582

naive bayes模型准确率为0.8046 > 0.7582，则还是比baseline模型要好的

Confusion matrix

True Positive(TP): 预测为真，实际也为真
False Positive(FP):预测为真，实际为假，Type 1 error
True Negative(TN): 预测为假，实际也为假
False Negative(FN):预测为假，实际为真, Type 2 error

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print('Confusion matrix\n\n',cm)
print('\n TP = ',cm[0,0])
print('\nTN = ', cm[1,1])

print('\nFP = ', cm[0,1])

print('\nFN = ', cm[1,0])

Confusion matrix

 [[5954 1453]
 [ 456 1906]]

 TP =  5954

TN =  1906

FP =  1453

FN =  456

cm_matrix = pd.DataFrame(data=cm,columns = ['Actual Positive:1', 'Actual Negative:0'],index = ['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix,annot=True,fmt='d',cmap='YlGnBu')

在这里插入图片描述

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       <=50K       0.93      0.80      0.86      7407
        >50K       0.57      0.81      0.67      2362

    accuracy                           0.80      9769
   macro avg       0.75      0.81      0.76      9769
weighted avg       0.84      0.80      0.81      9769

TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]

# print precision score

precision = TP / float(TP + FP)


print('Precision : {0:0.4f}'.format(precision))

Precision : 0.8038

recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))

Recall or Sensitivity : 0.9289

from sklearn.metrics import f1_score
f1_score = f1_score(y_test, y_pred)
print('f1 score : {:.4f}'.format(f1_score))

f1 score : 0.6663

# TPR TRUE POSITIVE RATE:same as recall
true_positive_rate = TP / float(TP + FN)


print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))

True Positive Rate : 0.9289

#FPR false positive rate
false_positive_rate = FP / float(FP + TN)


print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))

False Positive Rate : 0.4326

# SPECIFICITY : 1- fpr
specificity = TN / (TN + FP)

print('Specificity : {0:0.4f}'.format(specificity))

Specificity : 0.5674

输出预测概率

一般模型默认，threshold为0.5，即>0.5，模型认为是class1 即>50K

y_pred_prob = gnb.predict_proba(X_test)
y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - <=50K', 'Prob of - >50K'])
y_pred_prob_df.head()

	Prob of - <=50K	Prob of - >50K
0	9.999994e-01	5.741524e-07
1	9.996879e-01	3.120935e-04
2	1.544056e-01	8.455944e-01
3	1.736243e-04	9.998264e-01
4	8.201210e-09	1.000000e+00

y_pred1 = gnb.predict_proba(X_test)[:, 1]

plt.rcParams['font.size'] = 12


# plot histogram with 10 bins
plt.hist(y_pred1, bins = 10)


# set the title of predicted probabilities
plt.title('Histogram of predicted probabilities of salaries >50K')


# set the x-axis limit
plt.xlim(0,1)


# set the title
plt.xlabel('Predicted probabilities of salaries >50K')
plt.ylabel('Frequency')

在这里插入图片描述

from sklearn.metrics import roc_curve
fpr,tpr,threshold = roc_curve(y_test, y_pred1, pos_label = ' >50K')
plt.figure(figsize=(15,8))
plt.plot(fpr,tpr,linewidth=2)
plt.plot([0,1],[0,1],'k--')
plt.rcParams['font.size'] = 12
plt.title('ROC curve for Gaussian Naive Bayes Classifier for Predicting Salaries')

plt.xlabel('False Positive Rate (1 - Specificity)')

plt.ylabel('True Positive Rate (Sensitivity)')

在这里插入图片描述

from sklearn.metrics import roc_auc_score
ROC_AUC = roc_auc_score(y_test,y_pred1)
print('ROC AUC : {:.4f}'.format(ROC_AUC))

ROC AUC : 0.8932

Sarah_07

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫