目录
Prerequisite
- 条件概率,全概率公式,贝叶斯定理
- 极大似然估计,贝叶斯估计 (https://zhuanlan.zhihu.com/p/61593112)
- 笔和纸
贝叶斯决策论(策略)
简单来说,如何决策(Decision)取决于所有相关概率已知(Probability)和误判损失(Loss)
损失的反义词是奖励,然后就跟强化学习有点像,如何决策取决于期望奖励。
Driving Example
考虑:
特
征
向
量
:
x
∈
R
n
,
类
别
标
记
集
合
:
y
=
{
c
1
,
c
2
,
⋯
c
K
}
,
特征向量:\mathbf x\in \mathbf R^n, 类别标记集合: y=\{c_1, c_2, \cdots c_K\},
特征向量:x∈Rn,类别标记集合:y={c1,c2,⋯cK},
记
λ
i
j
为
将
c
j
误
分
类
c
i
的
损
失
.
R
(
c
i
∣
x
)
为
将
x
样
本
分
类
为
c
i
的
期
望
损
失
.
(
e
x
p
e
c
t
e
d
l
o
s
s
)
记\lambda_{ij} 为将c_j误分类c_i的损失. R(c_i\vert \mathbf x)为将\mathbf x样本分类为c_i的期望损失.(expected\ loss)
记λij为将cj误分类ci的损失.R(ci∣x)为将x样本分类为ci的期望损失.(expected loss)
那么如何来计算这个损失呢:
观察下式:
R
(
c
i
∣
x
)
=
∑
j
=
1
K
λ
i
j
p
(
c
j
∣
x
)
R(c_i\vert \mathbf x)=\sum_{j=1}^K\lambda_{ij}p(c_j\vert \mathbf x)
R(ci∣x)=j=1∑Kλijp(cj∣x)
为了方便理解:
定义:
λ
i
j
=
{
0
if i = j
1
otherwise
\lambda_{ij}= \begin{cases} 0& \text{if i = j}\\ 1& \text{otherwise} \end{cases}
λij={01if i = jotherwise
对于某一个样本
x
\mathbf x
x,且
p
(
c
1
∣
x
)
=
0.1
,
p
(
c
2
∣
x
)
=
0.2
,
p
(
c
3
∣
x
)
=
0.7
p(c_1\vert \mathbf x)=0.1, p(c_2\vert \mathbf x)=0.2, p(c_3\vert \mathbf x)=0.7
p(c1∣x)=0.1,p(c2∣x)=0.2,p(c3∣x)=0.7,这个后验概率的算法会在后文写到。
那么代入上面的期望损失可得到:
R
(
c
1
∣
x
)
=
0
+
1
∗
0.2
+
1
∗
0.7
=
0.9
=
1
−
p
(
c
1
∣
x
)
R(c_1\vert \mathbf x)=0+1*0.2+1*0.7=0.9=1-p(c_1\vert x)
R(c1∣x)=0+1∗0.2+1∗0.7=0.9=1−p(c1∣x)
R
(
c
2
∣
x
)
=
1
∗
0.1
+
0
+
1
∗
0.7
=
0.8
=
1
−
p
(
c
2
∣
x
)
R(c_2\vert \mathbf x)=1*0.1+0+1*0.7=0.8=1-p(c_2\vert x)
R(c2∣x)=1∗0.1+0+1∗0.7=0.8=1−p(c2∣x)
R
(
c
3
∣
x
)
=
1
∗
0.1
+
1
∗
0.2
+
0
=
0.3
=
1
−
p
(
c
3
∣
x
)
R(c_3\vert \mathbf x)=1*0.1+1*0.2+0=0.3=1-p(c_3\vert x)
R(c3∣x)=1∗0.1+1∗0.2+0=0.3=1−p(c3∣x)
直观的来看,我们只需要选择
arg min
i
R
(
c
i
∣
x
)
\argmin_i R(c_i\vert \mathbf x)
iargminR(ci∣x)或者
arg max
i
p
(
c
i
∣
x
)
\argmax_ip(c_i\vert \mathbf x)
iargmaxp(ci∣x)就达到了决策的目的。
从单样本上升到样本集,即从期望风险上升到总体风险:
E
x
[
R
(
c
i
∣
x
)
]
E_x[R(c_i\vert \mathbf x)]
Ex[R(ci∣x)]。我们的最终目标便是最小化总体风险,当然需要一个分类器来解决,则有:
h
(
x
)
=
arg min
i
E
x
[
R
(
c
i
∣
x
)
]
h(\mathbf x)=\argmin_iE_\mathbf x[R(c_i\vert \mathbf x)]
h(x)=iargminEx[R(ci∣x)]
结合上面例子的分析要达到这一目的,只需要:
h
(
x
)
=
arg max
i
p
(
c
i
∣
x
)
h(\mathbf x)=\argmax_i p(c_i\vert \mathbf x)
h(x)=iargmaxp(ci∣x)
生成模型和判别模型
简单来说,生成模型需要学习输入空间
<
X
,
Y
>
<\mathcal X, \mathcal Y>
<X,Y>的联合概率分布,从而得到我们想要的
p
(
c
i
∣
x
)
p(c_i\vert \mathbf x)
p(ci∣x)。而对于判别模型来说,我们则是通过直接建模的方式来获得
p
(
c
i
∣
x
)
p(c_i\vert \mathbf x)
p(ci∣x)。
常见的生成模型:朴素贝叶斯,决策树,隐马尔可夫模型等
常见的判别模型:线性回归,SVM等
算法
贝叶斯定理&问题转换
通过上面的分析,我们需要找到
p
(
c
∣
x
)
p(c\vert \mathbf x)
p(c∣x)。(简化表达,去掉 i 下标)
那么由条件概率有:
p
(
c
∣
x
)
=
p
(
c
,
x
)
p
(
x
)
p(c\vert \mathbf x)=\frac{p(c, \mathbf x)}{p(\mathbf x)}
p(c∣x)=p(x)p(c,x)
又由:
p
(
c
,
x
)
=
p
(
x
∣
c
)
⋅
p
(
c
)
p(c,\mathbf x)=p(\mathbf x\vert c)\cdot p(c)
p(c,x)=p(x∣c)⋅p(c)
得:
p
(
c
∣
x
)
=
p
(
x
∣
c
)
⋅
p
(
c
)
p
(
x
)
p(c\vert \mathbf x)=\frac{p(\mathbf x\vert c) \cdot p(c)}{p(\mathbf x)}
p(c∣x)=p(x)p(x∣c)⋅p(c)
便得到了贝叶斯定理。
现在我们的问题就转换到:
h
(
x
)
=
arg max
c
p
(
x
∣
c
)
⋅
p
(
c
)
h(\mathbf x)=\argmax_c p(\mathbf x\vert c)\cdot p(c)
h(x)=cargmaxp(x∣c)⋅p(c)
朴素贝叶斯&特征条件独立性假设
考虑到
p
(
x
∣
c
)
p(\mathbf x\vert c)
p(x∣c)的参数为
K
∗
∏
i
=
1
N
S
i
K*\prod\limits_{i=1}^{N}S_i
K∗i=1∏NSi,其中
S
i
S_i
Si代表每个特征的特征取值个数。若考虑每个特征仅仅为二值情况,那么参数个数即为
K
∗
2
n
K*2^n
K∗2n,这个没人顶得住。
所以朴素贝叶斯就诞生了:
p
(
x
∣
c
)
=
p
(
x
1
∣
c
)
⋅
p
(
x
2
∣
c
)
⋅
p
(
x
3
∣
c
)
⋯
p
(
x
N
∣
c
)
p(\mathbf x\vert c)=p(x_1\vert c)\cdot p(x_2\vert c)\cdot p(x_3\vert c)\cdots p(x_N\vert c)
p(x∣c)=p(x1∣c)⋅p(x2∣c)⋅p(x3∣c)⋯p(xN∣c)
此时参数个数为:
K
∗
∑
i
=
1
N
S
i
K*\sum\limits_{i=1}^NS_i
K∗i=1∑NSi,还是取值还是二值的话,那么即为
K
∗
N
∗
2
K*N*2
K∗N∗2。
计算所有相关概率
极大似然估计
首先来看简单的
p
(
c
)
p(c)
p(c):
当采用极大似然估计则有:
p
(
c
i
)
=
n
i
N
p(c_i)=\frac{n_i}{N}
p(ci)=Nni
n_i表示类别属于c_i的数目
而
p
(
x
∣
c
)
p(\mathbf x\vert c)
p(x∣c):
采用极大似然估计有:
p
(
x
i
j
∣
c
i
)
=
m
(
x
i
j
)
n
i
p(x_i^j\vert c_i)=\frac{m(x_i^j)}{n_i}
p(xij∣ci)=nim(xij)
m
(
x
i
)
m(x_i)
m(xi)表示在
c
=
c
i
c=c_i
c=ci的情况下,特征 j 的取值为 i的数目
到这里,我们会发现一个问题,也就是说,可能会有特征 j 取 i =0的情况,在 c 1 c_1 c1中有,但是在 c 2 c_2 c2中没有,那么会导致 p ( x ∣ c i ) = 0 n i p(\mathbf x\vert c_i)=\frac {0}{n_i} p(x∣ci)=ni0会影响分类效果。因此我们采用下面的贝叶斯估计法。
贝叶斯估计
平滑因子 λ \lambda λ
通式:
p
(
c
i
)
=
n
i
+
λ
N
+
K
λ
p
(
x
∣
c
i
)
=
m
(
x
)
+
λ
n
i
+
S
λ
p(c_i)=\frac{n_i+\lambda}{N+K\lambda}\\ p(\mathbf x\vert c_i)=\frac{m(\mathbf x)+\lambda}{n_i+S\lambda}
p(ci)=N+Kλni+λp(x∣ci)=ni+Sλm(x)+λ
其中,
K
K
K为类别的个数,
S
S
S为特征j的值的取值个数
对于不同的
λ
\lambda
λ的取值,我们有:
λ
=
0
\lambda = 0
λ=0 极大似然估计
λ
∈
(
0
,
1
)
\lambda \in (0, 1)
λ∈(0,1) Lidstone 平滑
λ
=
1
\lambda=1
λ=1Laplacian 平滑
高斯模型
上文所有的例子都是基于特征离散取值的。
那么如何才能处理连续性取值的特征呢?
这里我们便通过引入假设模型的方法来处理。
考虑:
特征 j 为身高 h ,那么我们可以通过假设其在类别给定的情况下,是符合高斯分布的。且均值
μ
=
h
ˉ
\mu=\bar{h}
μ=hˉ,标准差为
σ
=
E
[
(
h
−
μ
)
2
]
\sigma=E[(\mathbf h-\mu)^2]
σ=E[(h−μ)2]。那么便可以利用这个参数去计算相应的概率了。
详细请跳转:https://www.letiantian.me/2014-10-12-three-models-of-naive-nayes/
实战
调包
from sklearn.preprocessing import OrdinalEncoder
from sklearn.naive_bayes import CategoricalNB
# 必须转码才能用CategoricalNB
encoder = OrdinalEncoder()
encoder.fit(datasets.iloc[:, :-1])
X_train, X_test = encoder.transform(X_train), encoder.transform(X_test)
bayes = CategoricalNB(alpha=1)
bayes.fit(X_train, y_train)
pred = bayes.predict(X_test)
for i in zip(pred, y_test):
print("predict: " + i[0]+", true: " + i[1])
# predict: 是, true: 否
# predict: 是, true: 否
# predict: 是, true: 是
# predict: 是, true: 是
手冲
import pandas as pd
import numpy as np
datasets = pd.DataFrame([
["青绿","蜷缩","浊响","清晰","凹陷","硬滑","是"],
["乌黑","蜷缩","沉闷","清晰","凹陷","硬滑","是"],
["乌黑","蜷缩","浊响","清晰","凹陷","硬滑","是"],
["青绿","蜷缩","沉闷","清晰","凹陷","硬滑","是"],
["浅白","蜷缩","浊响","清晰","凹陷","硬滑","是"],
["青绿","稍蜷","浊响","清晰","稍凹","软粘","是"],
["乌黑","稍蜷","浊响","稍糊","稍凹","软粘","是"],
["乌黑","稍蜷","浊响","清晰","稍凹","硬滑","是"],
["乌黑","稍蜷","沉闷","稍糊","稍凹","硬滑","否"],
["青绿","硬挺","清脆","清晰","平坦","软粘","否"],
["浅白","硬挺","清脆","模糊","平坦","硬滑","否"],
["浅白","蜷缩","浊响","模糊","平坦","软粘","否"],
["青绿","稍蜷","浊响","稍糊","凹陷","硬滑","否"],
["浅白","稍蜷","沉闷","稍糊","凹陷","硬滑","否"],
["乌黑","稍蜷","浊响","清晰","稍凹","软粘","否"],
["浅白","蜷缩","浊响","模糊","平坦","硬滑","否"],
["青绿","蜷缩","沉闷","稍糊","稍凹","硬滑","否"]], columns=["色泽","根蒂","敲声","纹理","脐部","触感","好瓜"])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(datasets.iloc[:, :-1], datasets.iloc[:, -1], test_size=0.2, random_state=77)
print(X_train)
print(y_test)
# 色泽 根蒂 敲声 纹理 脐部 触感
# 10 浅白 硬挺 清脆 模糊 平坦 硬滑
# 4 浅白 蜷缩 浊响 清晰 凹陷 硬滑
# 6 乌黑 稍蜷 浊响 稍糊 稍凹 软粘
# 12 青绿 稍蜷 浊响 稍糊 凹陷 硬滑
# 13 浅白 稍蜷 沉闷 稍糊 凹陷 硬滑
# 3 青绿 蜷缩 沉闷 清晰 凹陷 硬滑
# 15 浅白 蜷缩 浊响 模糊 平坦 硬滑
# 7 乌黑 稍蜷 浊响 清晰 稍凹 硬滑
# 9 青绿 硬挺 清脆 清晰 平坦 软粘
# 0 青绿 蜷缩 浊响 清晰 凹陷 硬滑
# 8 乌黑 稍蜷 沉闷 稍糊 稍凹 硬滑
# 5 青绿 稍蜷 浊响 清晰 稍凹 软粘
# 11 浅白 蜷缩 浊响 模糊 平坦 软粘
# 14 否
# 16 否
# 2 是
# 1 是
# 计算p(c)的贝叶斯估计(平滑因子为laplacian)
def get_prior(y):
lu_table = dict() # look up table
num_classes = len(y.unique())
num_samples = y.count()
lambda_ = 1 # laplacian smoothing
for index in y.value_counts().index :
lu_table[index] = (y.value_counts()[index] + lambda_) / (num_samples + num_classes * lambda_)
return lu_table
# 计算p(x|c)的贝叶斯估计
def get_likelihood(X, y):
lu_table = dict()
# 记录特征:[所有可能取值]
features_table = dict()
lambda_ = 1
# 某个特征能取值的个数
for feature_name in X :
features_table[feature_name] = X[feature_name].unique()
print(y)
for class_ in y.unique():
lu_table[class_] = dict()
X_class = X[class_ == y]
for feature_name in features_table.keys():
x = X_class[feature_name]
values = x.unique()
value_counts = x.value_counts()
num_values = len(features_table[feature_name])
for value in features_table[feature_name]:
if value in values:
lu_table[class_][value] = (value_counts[value] + lambda_) / (y.value_counts()[class_] + num_values * lambda_)
# 如果在某一类取值情况,特征的某一取值为零
else:
lu_table[class_][value] = (0 + lambda_) / (y.value_counts()[class_] + num_values * lambda_)
return lu_table
print(get_prior(y_train))
print(get_likelihood(X_train, y_train))
# {'否': 0.5333333333333333, '是': 0.4666666666666667}
# {'否': {'浅白': 0.5, '乌黑': 0.2, '青绿': 0.3, '硬挺': 0.3, '蜷缩': 0.3, '稍蜷': 0.4, '清脆': 0.3, '浊响': 0.4, '沉闷': 0.3, '模糊': 0.4, '清晰': 0.2, '稍糊': 0.4, '平坦': 0.5, '凹陷': 0.3, '稍凹': 0.2, '硬滑': 0.6666666666666666, '软粘': 0.3333333333333333}, '是': {'浅白': 0.2222222222222222, '乌黑': 0.3333333333333333, '青绿': 0.4444444444444444, '硬挺': 0.1111111111111111, '蜷缩': 0.4444444444444444, '稍蜷': 0.4444444444444444, '清脆': 0.1111111111111111, '浊响': 0.6666666666666666, '沉闷': 0.2222222222222222, '模糊': 0.1111111111111111, '清晰': 0.6666666666666666, '稍糊': 0.2222222222222222, '平坦': 0.1111111111111111, '凹陷': 0.4444444444444444, '稍凹': 0.4444444444444444, '硬滑': 0.625, '软粘': 0.375}}
class_lu_table = get_prior(y_train)
feature_lu_table = get_likelihood(X_train, y_train)
def get_predict(sample, class_lu_table, feature_lu_table):
class_proba = dict()
for class_ in class_lu_table.keys():
class_proba[class_] = class_lu_table[class_]
for value in sample:
class_proba[class_] *= feature_lu_table[class_][value]
return max(zip(class_proba.keys(), class_proba.values()))
for i in range(len(X_test)):
print("predict: "+get_predict(X_test.iloc[i], class_lu_table, feature_lu_table)[0]+", true: "+y_test.iloc[i])
# predict: 是, true: 否
# predict: 是, true: 否
# predict: 是, true: 是
# predict: 是, true: 是
# 结果跟sklearn一样
# 准确率不高,应该是样本太少了,误差比较大。
优缺点:
如果你感到一脸懵逼,那么:
\qquad\qquad\qquad\quad
\qquad\qquad\qquad\quad
\qquad\qquad\qquad\quad
参考书籍:
李航《统计学习方法》
周志华 《机器学习》
参考链接:
https://github.com/datawhalechina/team-learning/tree/master/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AE%97%E6%B3%95%E5%9F%BA%E7%A1%80