第五节:分类器中的天真小弟 —— 朴素贝叶斯
考虑如下文本分类模型: P ( y i , d i ) P(y_i, d_i) P(yi,di) 表示一篇文章以及它的 label 的联合概率。 d i d_i di 指第 i 条训练数据中的文本。假设 d i d_i di 中每个词都是一个特征。
条件独立分布假设:已知文本标签的条件下,特征的分布是相互独立的。(已知标签后 y i y_i yi, d i d_i di 的概率等于该文本中每个词出现的概率乘积。
利用贝叶斯条件概率公式:
P ( y i , d i ) = P ( y = y i ) P ( d i ∣ y = y i ) P(y_i, d_i)=P(y=y_i)P(d_i\ |\ y=y_i) P(yi,di)=P(y=yi)P(di ∣ y=yi)
= P ( y = y i ) ∏ j = 1 V P ( x j ∣ y = y i ) C d i ( x j ) \quad\quad\quad\quad=P(y=y_i)\displaystyle \prod^{V}_{j=1}P(x_j\ |\ y=y_i)^{C_{d_i}(x_j)} =P(y=yi)j=1∏VP(xj ∣ y=yi)Cdi(xj)
其中, V V V 代表字典的 size, x j x_j xj 代表 the j t h j^{th} jth word in dictionary, C d i ( x j ) C_{d_i}(x_j) Cdi(xj) 代表词 x j x_j xj 在文件 d i d_i di 中出现的次数。
Define: P ( y = y i ) = π y i , P ( x j ∣ y = y i ) = θ y i , x j ← P(y=y_i)=\pi_{y_i}, \quad P(x_j\ |\ y=y_i)=\theta_{y_i,\ x_j}\ \leftarrow P(y=yi)=πyi,P(xj ∣ y=yi)=θyi, xj ← 这两种概率就是我们要估计的参数
⇒ P ( y i , d i ) = π y i ∏ j = 1 V ( θ y i , x j ) C d i ( x j ) \Rightarrow P(y_i, d_i)=\pi_{y_i}\displaystyle \prod^{V}_{j=1}{(\theta_{y_i,\ x_j})}^{C_{d_i}(x_j)} ⇒P(yi,di)=πyij=1∏V(θyi, xj)Cdi(xj)
(参数下面带波浪线代表是参数向量 / 矩阵, D 代表数据集中文件的总个数)
L i k e l i h o o d ( π ∼ , θ ∼ ) = ∏ i = 1 D P ( y i , d i ) Likelihood(\underset \sim{\pi},\ \underset \sim{\theta})=\displaystyle \prod^D_{i=1}P(y_i, d_i) Likelihood(∼π, ∼θ)=i=1∏DP(yi,di)
= ∏ i = 1 D [ π y i ∏ j = 1 V ( θ y i , x j ) C d i ( x j ) ] \quad\quad\quad\quad\quad=\displaystyle \prod^D_{i=1}[\pi_{y_i}\displaystyle \prod^{V}_{j=1}{(\theta_{y_i,\ x_j})}^{C_{d_i}(x_j)}] =i=1∏D[πyij=1∏V(θyi, xj)Cdi(xj)]
l o g ( L i k e l i h o o d ( π ∼ , θ ∼ ) ) = ∑ i = 1 D [ l o g π y i + ∑ j = 1 V C d i ( x j ) l o g ( θ y i , x j ) ] log(Likelihood(\underset \sim{\pi},\ \underset \sim{\theta}))=\displaystyle \sum^D_{i=1}\Big[log\pi_{y_i}+\displaystyle \sum^{V}_{j=1}{C_{d_i}(x_j)}log(\theta_{y_i,\ x_j})\Big] log(Likelihood(∼π, ∼θ))=i=1∑D[logπyi+j=1∑VCdi(xj)log(θyi, xj)]
约束: { ① ∑ k = 1 K π k = 1 ② f o r e v e r y k , ∑ j = 1 V θ k , j = 1 \begin{cases} ①\ \ \displaystyle \sum^{K}_{k=1}\pi_k=1 \\ ②\ \ for\ every\ k,\ \displaystyle \sum^V_{j=1}\theta_{k,\ j}=1 \end{cases} ⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧① k=1∑Kπk=1② for every k, j=1∑Vθk, j=1
l o g ( L π k ) = ∑ i = 1 D [ l o g π y i + ∑ j = 1 V C d i ( x j ) l o g ( θ y i , x j ) ] + α ( ∑ k = 1 K π k − 1 ) ← log(L_{\pi_k})=\displaystyle \sum^D_{i=1}\Big[log\pi_{y_i}+\displaystyle \sum^{V}_{j=1}{C_{d_i}(x_j)}log(\theta_{y_i,\ x_j})\Big]+\alpha(\sum^K_{k=1}\pi_k-1)\ \leftarrow log(Lπk)=i=1∑D[logπyi+j=1∑VCdi(xj)log(θyi, xj)]+α(k=1∑Kπk−1) ← 对 π k \pi_k πk 求最优解,对第一个约束引入拉格朗日乘子
∂ l o g ( L π k ) ∂ π k = ∑ i = 1 D 1 π k I y i = k + α = 0 , \frac{\partial{log(L_{\pi_k})}}{\partial{\pi_k}}=\displaystyle \sum^D_{i=1}\frac{1}{\pi_k}I_{y_i=k}+\alpha=0, ∂πk∂log(Lπk)=i=1∑Dπk1Iyi=k+α=0, f o r k = 1 , 2 , . . . , K ← \quad for\ k=1,\ 2,\ ...,\ K\ \leftarrow for