一文由浅入深理解Naive Bayes分类算法
基本概率知识
- 条件概率:在某个特定条件下的(特定样本空间)的特定概率
p ( B ∣ S ) = p ( B ⋂ S ) p ( S ) p(B|S) = \frac{p(B\bigcap S)}{p(S)} p(B∣S)=p(S)p(B⋂S) - 乘法公式:
p ( A B ) = p ( A ) p ( B ∣ A ) = p ( A ) p ( B ⋂ A ) p ( A ) p(AB) = p(A)p(B|A) = p(A)\frac{p(B\bigcap A)}{p(A)} p(AB)=p(A)p(B∣A)=p(A)p(A)p(B⋂A)
其中 p ( A ) p ( B ∣ A ) p(A)p(B|A) p(A)p(B∣A)是 B ⋂ A B\bigcap A B⋂A相对于全样本空间 Ω \Omega Ω的概率, p ( B ∣ A ) p(B|A) p(B∣A)是 B ⋂ A B\bigcap A B⋂A相对于样本空间A的概率 - bayes公式:
p ( B i ∣ A ) = p ( A ∣ B i ) p ( B i ) ∑ i = 1 n p ( B i ) p ( A ∣ B i ) p(B_i|A) = \frac{p(A|B_i)p(B_i)}{\sum_{i=1}^np(B_i)p(A|B_i)} p(Bi∣A)=∑i=1np(Bi)p(A∣Bi)p(A∣Bi)p(Bi)
举个栗子
某种罕见病检测结果的准确度为99%,在人群中患病者的比例是0.5%。小张去医院做检查,检测结果为阳性,小张获病的概率是多少
如果A表示“真的患病”,B表示“检测呈阳性”。即P(A) = 0.5%,求P(A|B)
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
∣
A
)
P
(
A
)
+
P
(
A
−
)
P
(
B
∣
A
−
)
=
99
%
∗
0.5
%
99
%
∗
0.5
%
+
(
1
−
0.5
%
)
(
1
−
99
%
)
=
0.33
P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(A^-)P(B|A^-)} = \frac{99\% * 0.5\%}{99\% * 0.5\% + (1-0.5\%)(1-99\%)} = 0.33
P(A∣B)=P(B∣A)P(A)+P(A−)P(B∣A−)P(B∣A)P(A)=99%∗0.5%+(1−0.5%)(1−99%)99%∗0.5%=0.33
如果复查后,检测结果仍为阳,小张获病的概率是多少呢?
此时,阳性人群中的获病概率是0.33,则小张获病的概率是
P
(
A
∣
B
)
=
P
(
B
∣
A
)
P
(
A
)
P
(
B
∣
A
)
P
(
A
)
+
P
(
A
−
)
P
(
B
∣
A
−
)
=
99
%
∗
0.33
99
%
∗
0.33
+
(
1
−
0.33
)
(
1
−
99
%
)
=
0.98
P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(A^-)P(B|A^-)} = \frac{99\% * 0.33}{99\% * 0.33 + (1-0.33)(1-99\%)} = 0.98
P(A∣B)=P(B∣A)P(A)+P(A−)P(B∣A−)P(B∣A)P(A)=99%∗0.33+(1−0.33)(1−99%)99%∗0.33=0.98
所以说嘛,疫情期间,为啥第一次检测阳没有被拉走而是第二次阳才被拉走。
朴素贝叶斯分类
原理简介
垃圾邮件分类,假如收到了一封邮件D,判断是否是垃圾邮件。也就是求垃圾邮件 p ( h + ) p(h^+) p(h+)的概率和非垃圾邮件的概率 p ( h − ) p(h^-) p(h−)哪个大。垃圾邮件的概率 p ( h + ∣ D ) = p ( h + ) p ( D ∣ h + ) p ( D ) p(h^+|D) = \frac{p(h^+)p(D|h^+)}{p(D)} p(h+∣D)=p(D)p(h+)p(D∣h+),正常邮件的概率 p ( h − ∣ D ) = p ( h − ) p ( D ∣ h − ) p ( D ) p(h^-|D) = \frac{p(h^-)p(D|h^-)}{p(D)} p(h−∣D)=p(D)p(h−)p(D∣h−)。两个分母一样,也就是求分子的大小。 p ( h + ) p(h^+) p(h+)和 p ( h − ) p(h^-) p(h−)可以从训练集很容易求出。邮件是由一个个单词组成,含有N个单词, d 1 , d 2 , . . . d n d_1,d_2,...d_n d1,d2,...dn, p ( D ∣ h + ) = p ( d 1 , d 2 , . . . d n ∣ h + ) = p ( d 1 ∣ h + ) p ( d 2 ∣ d 1 , h + ) ∗ p ( d 3 ∣ d 2 , d 1 , h + ) . . . p ( d n ∣ d n − 1 . . . d 1 , h + ) p(D|h^+) = p(d_1,d_2,...d_n|h^+) = p(d_1|h^+)p(d_2|d_1,h^+)*p(d_3|d_2,d_1,h^+)...p(d_n|d_{n-1}...d_1,h^+) p(D∣h+)=p(d1,d2,...dn∣h+)=p(d1∣h+)p(d2∣d1,h+)∗p(d3∣d2,d1,h+)...p(dn∣dn−1...d1,h+).朴素贝叶斯是假设特征之间相互独立,则 p ( D ∣ h + ) p(D|h^+) p(D∣h+)就可以简化成 p ( d 1 ∣ h + ) p ( d 2 ∣ h + ) ∗ p ( d 3 ∣ h + ) . . . p ( d n ∣ h + ) p(d_1|h^+)p(d_2|h^+)*p(d_3|h^+)...p(d_n|h^+) p(d1∣h+)p(d2∣h+)∗p(d3∣h+)...p(dn∣h+),这样统计垃圾邮件中每个单词出现的频率就可以了。
Demo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
load data
df = pd.read_csv('.\\data\\adult.csv',header=None)
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df.columns = col_names
df['native_country'].unique()
array([' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico',
' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada',
' Germany', ' Iran', ' Philippines', ' Italy', ' Poland',
' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos',
' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic',
' El-Salvador', ' France', ' Guatemala', ' China', ' Japan',
' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland',
' Trinadad&Tobago', ' Greece', ' Nicaragua', ' Vietnam', ' Hong',
' Ireland', ' Hungary', ' Holand-Netherlands'], dtype=object)
数据清洗
categorical = [col for col in df.columns if df[col].dtype=='O']
for col in categorical:
df[col].replace(' ?',np.NaN,inplace=True)
numerical = [col for col in df.columns if df[col].dtype!='O']
df[numerical].isnull().sum()
age 0
fnlwgt 0
education_num 0
capital_gain 0
capital_loss 0
hours_per_week 0
dtype: int64
df['income'].unique()
array([' <=50K', ' >50K'], dtype=object)
df['income'] = df['income'].map({' <=50K':0,' >50K':1})
X = df.drop(['income'],axis=1)
y= df['income']
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
特征处理
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_test = y_test.reset_index(drop = True)
categorical.remove('income')
X_test[categorical].isnull().mean()
workclass 0.057324
education 0.000000
marital_status 0.000000
occupation 0.057836
relationship 0.000000
race 0.000000
sex 0.000000
native_country 0.017300
dtype: float64
for df2 in [X_train,X_test]:
df2['workclass'].fillna(df2['workclass'].mode()[0],inplace=True)
df2['occupation'].fillna(df2['occupation'].mode()[0],inplace=True)
df2['native_country'].fillna(df2['native_country'].mode()[0],inplace=True)
X_test[categorical].isnull().mean()
workclass 0.0
education 0.0
marital_status 0.0
occupation 0.0
relationship 0.0
race 0.0
sex 0.0
native_country 0.0
dtype: float64
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[categorical])
X_train_categorical = pd.DataFrame(enc.transform(X_train[categorical]).toarray())
column_name = enc.get_feature_names(categorical)
X_train_categorical.columns = column_name
X_train = X_train[numerical].join(X_train_categorical,on=X_train.index)
X_train.head()
age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | ... | native_country_ Portugal | native_country_ Puerto-Rico | native_country_ Scotland | native_country_ South | native_country_ Taiwan | native_country_ Thailand | native_country_ Trinadad&Tobago | native_country_ United-States | native_country_ Vietnam | native_country_ Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45 | 170871 | 9 | 7298 | 0 | 60 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 47 | 108890 | 9 | 1831 | 0 | 38 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 48 | 187505 | 10 | 0 | 0 | 50 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 29 | 145592 | 9 | 0 | 0 | 40 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 23 | 203003 | 4 | 0 | 0 | 25 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 105 columns
X_test_categorical = pd.DataFrame(enc.transform(X_test[categorical]).toarray())
column_name = enc.get_feature_names(categorical)
X_test_categorical.columns = column_name
X_test = X_test[numerical].join(X_test_categorical,on=X_test.index)
X_train.shape
(22792, 105)
cols = X_train.columns
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
X_test.shape
(9769, 105)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
X_train.head()
age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | ... | native_country_ Portugal | native_country_ Puerto-Rico | native_country_ Scotland | native_country_ South | native_country_ Taiwan | native_country_ Thailand | native_country_ Trinadad&Tobago | native_country_ United-States | native_country_ Vietnam | native_country_ Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.40 | -0.058906 | -0.333333 | 7298.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.50 | -0.578076 | -0.333333 | 1831.0 | 0.0 | -0.4 | 0.0 | 0.0 | 0.0 | -1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.55 | 0.080425 | 0.000000 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | -0.40 | -0.270650 | -0.333333 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 |
4 | -0.70 | 0.210240 | -2.000000 | 0.0 | 0.0 | -3.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 |
5 rows × 105 columns
构建模型
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train,y_train)
GaussianNB()
y_pred = gnb.predict(X_test)
from sklearn.metrics import accuracy_score
print('Model accuracy score:{0:0.4f}'.format(accuracy_score(y_test,y_pred)))
Model accuracy score:0.8046
y_train_pred = gnb.predict(X_train)
print('Model accuracy score:{0:0.4f}'.format(accuracy_score(y_train,y_train_pred)))
Model accuracy score:0.8067
test和train的accuracy score差不多,因此模型没有over fit ing
模型和baseline比较
baseline模型对于分类问题来说,就是将测试集都预测为最频繁出现的类
y_train.value_counts()
<=50K 17313
>50K 5479
Name: income, dtype: int64
将测试集都预测为<=50K,计算accuracy
y_test.value_counts()
<=50K 7407
>50K 2362
Name: income, dtype: int64
print('baseline accuracy为{0:0.4f}'.format(7407/(7407+2362)))
baseline accuracy为0.7582
naive bayes模型准确率为0.8046 > 0.7582,则还是比baseline模型要好的
Confusion matrix
- True Positive(TP): 预测为真,实际也为真
- False Positive(FP):预测为真,实际为假,Type 1 error
- True Negative(TN): 预测为假,实际也为假
- False Negative(FN):预测为假,实际为真, Type 2 error
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print('Confusion matrix\n\n',cm)
print('\n TP = ',cm[0,0])
print('\nTN = ', cm[1,1])
print('\nFP = ', cm[0,1])
print('\nFN = ', cm[1,0])
Confusion matrix
[[5954 1453]
[ 456 1906]]
TP = 5954
TN = 1906
FP = 1453
FN = 456
cm_matrix = pd.DataFrame(data=cm,columns = ['Actual Positive:1', 'Actual Negative:0'],index = ['Predict Positive:1', 'Predict Negative:0'])
sns.heatmap(cm_matrix,annot=True,fmt='d',cmap='YlGnBu')
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
<=50K 0.93 0.80 0.86 7407
>50K 0.57 0.81 0.67 2362
accuracy 0.80 9769
macro avg 0.75 0.81 0.76 9769
weighted avg 0.84 0.80 0.81 9769
TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]
# print precision score
precision = TP / float(TP + FP)
print('Precision : {0:0.4f}'.format(precision))
Precision : 0.8038
recall = TP / float(TP + FN)
print('Recall or Sensitivity : {0:0.4f}'.format(recall))
Recall or Sensitivity : 0.9289
from sklearn.metrics import f1_score
f1_score = f1_score(y_test, y_pred)
print('f1 score : {:.4f}'.format(f1_score))
f1 score : 0.6663
# TPR TRUE POSITIVE RATE:same as recall
true_positive_rate = TP / float(TP + FN)
print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))
True Positive Rate : 0.9289
#FPR false positive rate
false_positive_rate = FP / float(FP + TN)
print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))
False Positive Rate : 0.4326
# SPECIFICITY : 1- fpr
specificity = TN / (TN + FP)
print('Specificity : {0:0.4f}'.format(specificity))
Specificity : 0.5674
输出预测概率
一般模型默认,threshold为0.5,即>0.5,模型认为是class1 即>50K
y_pred_prob = gnb.predict_proba(X_test)
y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of - <=50K', 'Prob of - >50K'])
y_pred_prob_df.head()
Prob of - <=50K | Prob of - >50K | |
---|---|---|
0 | 9.999994e-01 | 5.741524e-07 |
1 | 9.996879e-01 | 3.120935e-04 |
2 | 1.544056e-01 | 8.455944e-01 |
3 | 1.736243e-04 | 9.998264e-01 |
4 | 8.201210e-09 | 1.000000e+00 |
y_pred1 = gnb.predict_proba(X_test)[:, 1]
plt.rcParams['font.size'] = 12
# plot histogram with 10 bins
plt.hist(y_pred1, bins = 10)
# set the title of predicted probabilities
plt.title('Histogram of predicted probabilities of salaries >50K')
# set the x-axis limit
plt.xlim(0,1)
# set the title
plt.xlabel('Predicted probabilities of salaries >50K')
plt.ylabel('Frequency')
from sklearn.metrics import roc_curve
fpr,tpr,threshold = roc_curve(y_test, y_pred1, pos_label = ' >50K')
plt.figure(figsize=(15,8))
plt.plot(fpr,tpr,linewidth=2)
plt.plot([0,1],[0,1],'k--')
plt.rcParams['font.size'] = 12
plt.title('ROC curve for Gaussian Naive Bayes Classifier for Predicting Salaries')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
from sklearn.metrics import roc_auc_score
ROC_AUC = roc_auc_score(y_test,y_pred1)
print('ROC AUC : {:.4f}'.format(ROC_AUC))
ROC AUC : 0.8932