许多概率模型有一系列可见变量
v
v
v和一系列潜变量
h
h
h,这时常常会涉及推断困难,就是指难以计算
p
(
h
∣
v
)
p(h|v)
p(h∣v)或其期望,而这样的操作在一些诸如最大似然学习的任务中往往是必需的。为此可以把精确推断问题描述为一个优化问题,借此推导出推断算法。为了构造这样一个优化问题,假设一个具有可见变量
v
v
v和潜变量
h
h
h的概率模型,按照最大似然估计,我们希望计算观察数据的对数概率
l
o
g
p
(
v
;
θ
)
log\ p(v;\theta)
log p(v;θ),则有
(1)
l
o
g
p
(
v
;
θ
)
=
l
o
g
∑
h
p
(
x
,
h
;
θ
)
log\ p(v;\theta)=log\ \sum_{h}p(x,h;\theta) \tag{1}
log p(v;θ)=log h∑p(x,h;θ)(1)
有时候,边缘化消去
h
h
h的操作很费时或难以计算,作为替代,我们可以计算一个
l
o
g
p
(
v
;
θ
)
log p(v;\theta)
logp(v;θ)的证据下界
L
(
v
,
θ
,
q
)
\mathcal{L}(v,\theta,q)
L(v,θ,q)
(2)
l
o
g
p
(
v
;
θ
)
=
l
o
g
∑
h
p
(
x
,
h
;
θ
)
=
l
o
g
∑
h
q
(
h
∣
v
)
p
(
v
,
h
;
θ
)
q
(
h
∣
v
)
≥
∑
h
q
(
h
∣
v
)
l
o
g
p
(
v
,
h
;
θ
)
q
(
h
∣
v
)
=
E
h
∼
q
[
l
o
g
p
(
h
,
v
;
θ
)
−
l
o
g
q
(
h
∣
v
)
]
=
l
o
g
p
(
v
;
θ
)
−
E
h
∼
q
[
l
o
g
q
(
h
∣
v
)
−
l
o
g
p
(
h
,
v
;
θ
)
+
l
o
g
p
(
v
;
θ
)
]
=
l
o
g
p
(
v
;
θ
)
−
E
h
∼
q
l
o
g
q
(
h
∣
v
)
p
(
v
,
h
;
θ
)
p
(
v
;
θ
)
=
l
o
g
p
(
v
;
θ
)
−
E
h
∼
q
l
o
g
q
(
h
∣
v
)
p
(
h
∣
v
)
=
l
o
g
p
(
v
;
θ
)
−
D
K
L
(
q
(
h
∣
v
)
∣
∣
p
(
h
∣
v
;
θ
)
)
=
E
h
∼
q
[
l
o
g
p
(
v
,
h
)
]
+
H
(
q
)
=
L
(
v
,
θ
,
q
)
log\ p(v;\theta)=log\ \sum_{h}p(x,h;\theta)=log\ \sum_hq(h|v)\frac{p(v,h;\theta)}{q(h|v)}\\ \geq\sum_hq(h|v)log\frac{p(v,h;\theta)}{q(h|v)}=\mathbb{E}_{h\sim q}[log\ p(h,v;\theta)-log\ q(h|v)]\\ =log\ p(v;\theta)-\mathbb{E}_{h\sim q}[log\ q(h|v)-log\ p(h,v;\theta)+log\ p(v;\theta)]\\ =log\ p(v;\theta)-\mathbb{E}_{h\sim q}log\ \frac{q(h|v)}{\frac{p(v,h;\theta)}{p(v;\theta)}}=log\ p(v;\theta)-\mathbb{E}_{h\sim q}log\ \frac{q(h|v)}{p(h|v)}\\ =log\ p(v;\theta)-D_{KL}(q(h|v)||p(h|v;\theta))=\mathbb{E}_{h\sim q}[log\ p(v,h)]+H(q)=\mathcal{L}(v,\theta,q) \tag{2}
log p(v;θ)=log h∑p(x,h;θ)=log h∑q(h∣v)q(h∣v)p(v,h;θ)≥h∑q(h∣v)logq(h∣v)p(v,h;θ)=Eh∼q[log p(h,v;θ)−log q(h∣v)]=log p(v;θ)−Eh∼q[log q(h∣v)−log p(h,v;θ)+log p(v;θ)]=log p(v;θ)−Eh∼qlog p(v;θ)p(v,h;θ)q(h∣v)=log p(v;θ)−Eh∼qlog p(h∣v)q(h∣v)=log p(v;θ)−DKL(q(h∣v)∣∣p(h∣v;θ))=Eh∼q[log p(v,h)]+H(q)=L(v,θ,q)(2)
其中
q
(
h
∣
v
)
q(h|v)
q(h∣v)表示给定一个数据的可见变量
v
v
v,其潜变量为
h
h
h的概率,对于一个选择的合适分布
q
q
q来说,
L
\mathcal{L}
L是容易计算的,对任意分布
q
q
q的选择来说,
L
\mathcal{L}
L提供了似然函数的一个下界。越好地近似
p
(
h
∣
v
)
p(h|v)
p(h∣v)的分布
q
(
h
∣
v
)
q(h|v)
q(h∣v),其下界越紧,当
q
(
h
∣
v
)
=
p
(
h
∣
v
)
q(h|v)=p(h|v)
q(h∣v)=p(h∣v)时,这个近似是完美的,也意味着
L
(
v
,
θ
,
q
)
=
l
o
g
p
(
v
;
θ
)
\mathcal{L}(v,\theta,q)=log\ p(v;\theta)
L(v,θ,q)=log p(v;θ),因此可以将推断问题看作找一个分布
q
q
q使
L
\mathcal{L}
L最大的过程。
在潜变量模型中,EM算法是非常常见的训练方法,这是一种能够学到近似后验的算法。
- E步:令 θ ( 0 ) \theta^{(0)} θ(0)表示在这一步开始时的参数值。对任何我们想要训练的(对所有的或者小批量数据均成立)索引为 i i i的训练样本 v ( i ) v^{(i)} v(i),令 q ( h ∣ v ( i ) ) = p ( h ∣ v ( i ) ; θ ) q(h|v^{(i)})=p(h|v^{(i)};\theta) q(h∣v(i))=p(h∣v(i);θ)。通过这个定义,我们认为 q q q在当前参数 θ ( 0 ) \theta^{(0)} θ(0)下定义,如果我们改变 θ \theta θ,那么 p ( h ∣ v ; θ ) p(h|v;\theta) p(h∣v;θ)将会相应变化,但是 q ( h ∣ v ) q(h|v) q(h∣v)还是不变并且等于 p ( h ∣ v ; θ ( 0 ) ) p(h|v;\theta^{(0)}) p(h∣v;θ(0))。
- M步:使用选择的优化算法完全地或部分地关于
θ
\theta
θ最大化
∑
i
L
(
v
(
i
)
,
θ
,
q
)
=
∑
i
∑
h
q
(
h
∣
v
(
i
)
)
l
o
g
p
(
v
(
i
)
,
h
;
θ
)
q
(
h
∣
v
(
i
)
)
\sum_i\mathcal{L}(v^{(i)},\theta,q)=\sum_i\sum_hq(h|v^{(i)})log\ \frac{p(v^{(i)},h;\theta)}{q(h|v^{(i)})}
∑iL(v(i),θ,q)=∑i∑hq(h∣v(i))log q(h∣v(i))p(v(i),h;θ)。
对于高斯混合模型,其概率分布为
(3) p ( v ; θ ) = ∑ k = 1 K π k N ( v ∣ μ k , Σ k ) p(v;\theta)=\sum_{k=1}^K\pi_kN(v|\mu_k,\Sigma_k) \tag{3} p(v;θ)=k=1∑KπkN(v∣μk,Σk)(3)
其中 ∑ k = 1 K π k = 1 \sum_{k=1}^K\pi_k=1 ∑k=1Kπk=1,这时的 v v v是数据样本,潜变量 h = { h 1 , h 2 , ⋯   , h K } , h k = { 0 , 1 } h=\{h_1,h_2,\cdots,h_K\},h_k=\{0,1\} h={h1,h2,⋯,hK},hk={0,1}代表数据样本是否来自 K K K个高斯分布中的第 k k k个高斯分布,在E步中有
(4) q ( h k = 1 ∣ v ( i ) ) = π k N ( v ( i ) ∣ μ k , Σ k ) ∑ k = 1 K π k N ( v ( i ) ∣ μ k , Σ k ) = π k N ( v ( i ) ∣ μ k , Σ k ) p ( v ( i ) ; θ ) q(h_k=1|v^{(i)})=\frac{\pi_kN(v^{(i)}|\mu_k,\Sigma_k)}{\sum_{k=1}^K\pi_kN(v^{(i)}|\mu_k,\Sigma_k)}=\frac{\pi_kN(v^{(i)}|\mu_k,\Sigma_k)}{p(v^{(i)};\theta)} \tag{4} q(hk=1∣v(i))=∑k=1KπkN(v(i)∣μk,Σk)πkN(v(i)∣μk,Σk)=p(v(i);θ)πkN(v(i)∣μk,Σk)(4)
在M步中,为了求解模型参数 π k , μ k , Σ k \pi_k,\mu_k,\Sigma_k πk,μk,Σk,需要对各参数求偏导
(5) J π = ∑ i ∑ k q ( h k ∣ v ( i ) ) l o g ( π k ) + λ ( ∑ k π k − 1 ) J_\pi=\sum_i\sum_kq(h_k|v^{(i)})log\ (\pi_k)+\lambda(\sum_k\pi_k-1) \tag{5} Jπ=i∑k∑q(hk∣v(i))log (πk)+λ(k∑πk−1)(5)
(6) ∂ J π ∂ π k = ∑ i q ( h k ∣ v ( i ) ) 1 π k + λ = 0 \frac{\partial J_\pi}{\partial\pi_k}=\sum_i q(h_k|v^{(i)})\frac{1}{\pi_k}+\lambda=0 \tag{6} ∂πk∂Jπ=i∑q(hk∣v(i))πk1+λ=0(6)
设样本数为 m m m,因为 ∑ k π k = 1 \sum_k\pi_k=1 ∑kπk=1,所以有
(7) ∑ i ∑ k q ( h k ∣ v ( i ) ) = − λ ∑ k π k = − λ = ∑ i 1 = m \sum_i\sum_kq(h_k|v^{(i)})=-\lambda\sum_k\pi_k=-\lambda=\sum_i1=m \tag{7} i∑k∑q(hk∣v(i))=−λk∑πk=−λ=i∑1=m(7)
(8) π k = ∑ i q ( h k ∣ v ( i ) ) m \pi_k=\frac{\sum_iq(h_k|v^{(i)})}{m} \tag{8} πk=m∑iq(hk∣v(i))(8)
同理求导则有
(9) μ k = ∑ i q ( h k ∣ v ( i ) ) v ( i ) ∑ i q ( h k ∣ v ( i ) ) \mu_k=\frac{\sum_iq(h_k|v^{(i)})v^{(i)}}{\sum_iq(h_k|v^{(i)})} \tag{9} μk=∑iq(hk∣v(i))∑iq(hk∣v(i))v(i)(9)
(10) Σ k = ∑ i q ( h k ∣ v ( i ) ) ( v ( i ) − μ k ) ( v ( i ) − μ k ) T ∑ i q ( h k ∣ v ( i ) ) \Sigma_k=\frac{\sum_iq(h_k|v^{(i)})(v^{(i)}-\mu_k)(v^{(i)}-\mu_k)^T}{\sum_iq(h_k|v^{(i)})} \tag{10} Σk=∑iq(hk∣v(i))∑iq(hk∣v(i))(v(i)−μk)(v(i)−μk)T(10)
Matlab代码实现
在代码实践中可能会出现一个问题,GMM迭代到某一次循环时,某类分布的方差会突然变的很小,导致一些本该分到其分布上的样本会因为概率太小而分到其它类,由于某类分布的样本较少甚至为零,计算过程中有一个除以每类样本数的步骤,这时可能会导致除以零的发生,从而产生非数 N a N NaN NaN,程序会运行出错。为了缓解这个问题,在第一步时运用 k m e a n s kmeans kmeans算法,对各类分布的均值方差参数进行初始化,使各类分布的均值均匀散布到数据样本的区域范围。
classdef GMM
properties
numOfGMs; %高斯分布的个数
pDim; %可见变量的空间维数
pMix; %每个分布的混合比例
mu; %均值参数
sigma; %方差参数
trainData_x; %训练数据
trainData_q_h; %训练数据的潜变量编码分布
N_sample; %训练数据样本总数
batchSize; %批次大小
IterMax; %最大迭代次数
iter_train; %当前迭代次数
isInited; %是否已经执行初始化
end
methods
function obj = GMM(pDim,numOfGMs) %类库构造函数
obj.numOfGMs = numOfGMs; %初始化类对象参数
obj.pDim = pDim;
obj.pMix = 1 / numOfGMs * ones(1,numOfGMs);
obj.mu = zeros(numOfGMs,pDim) + rand(numOfGMs,pDim) * 0.1;
obj.sigma = zeros(pDim,pDim,numOfGMs);
for k = 1 : obj.numOfGMs
obj.sigma(:,:,k) = eye(obj.pDim);
end
obj.batchSize = 100;
obj.iter_train = 0;
obj.IterMax = 100;
obj.isInited = 0;
end
function obj = train(obj,trainData_x,Iter)%训练
if(nargin == 2)
Iter = obj.IterMax;
end
if obj.isInited == 0
obj = obj.trainInit(trainData_x);
end
obj.N_sample = size(obj.trainData_x,1);
batch_size = obj.batchSize;
trainData_q_h_batch = zeros(batch_size,obj.numOfGMs);
diagMat = eye(batch_size);
diagMat0 = find(diagMat);
findDiag = diagMat0;
while obj.iter_train < Iter
obj.iter_train = obj.iter_train + 1;
trainData_x_batch = obj.trainData_x(randperm(obj.N_sample , batch_size) , :);
%E步,求解q(h|v)
for k = 1 : obj.numOfGMs
trainData_q_h_batch(:,k) = obj.pMix(k) * mvnpdf(trainData_x_batch,obj.mu(k,:),obj.sigma(:,:,k));
end
trainData_q_h_batch = trainData_q_h_batch ./ sum(trainData_q_h_batch,2);
%M步,求解模型参数
obj.pMix = mean(trainData_q_h_batch);
sumTrainDataQH = sum(trainData_q_h_batch)';
obj.mu = trainData_q_h_batch' * trainData_x_batch ./ sumTrainDataQH;
for k = 1 : obj.numOfGMs
diagMat(findDiag) = trainData_q_h_batch(:,k);
obj.sigma(:,:,k) = (trainData_x_batch - obj.mu(k,:))' * diagMat * (trainData_x_batch - obj.mu(k,:)) / sumTrainDataQH(k);
end
end
end
function [obj] = predict(obj)%预测重构
batch_size = obj.N_sample;
indexRandBatch = randperm(obj.N_sample , batch_size);
testData_x_batch = obj.trainData_x(indexRandBatch , :);
%求解q(h|v)
trainData_q_h_batch = zeros(batch_size,obj.numOfGMs);
for k = 1 : obj.numOfGMs
trainData_q_h_batch(:,k) = obj.pMix(k) * mvnpdf(testData_x_batch,obj.mu(k,:),obj.sigma(:,:,k));
end
trainData_q_h_batch = trainData_q_h_batch ./ sum(trainData_q_h_batch,2);
obj.trainData_q_h(indexRandBatch,:) = trainData_q_h_batch;
end
function [obj]=trainInit(obj,trainData_x)%使用kmeans进行初始化
obj.isInited = 1;
obj.trainData_x = trainData_x;
labelMat = k_means(obj.trainData_x,obj.numOfGMs,100);
labelCenter = obj.trainData_x' * labelMat ./ sum(labelMat);
obj.mu = labelCenter';
for k = 1 : obj.numOfGMs
obj.sigma(:,:,k) = cov(obj.trainData_x(labelMat(:,k)==1,:));
end
end
end
end
function yy = k_means(x0 , k , iter)
% kmeans修改输出参数y为yy,进行GMM的初始化
[n , p] = size(x0);
x = repmat(x0 , 1 , k);
%% 随机分配初始类中心
c = repmat( reshape( x( randperm(n , k) , 1:p)' , 1 , k * p) , n , 1);
%% 求和矩阵
sum_mat = kron(eye(k) , ones(p , 1));
%% 迭代计算
for i = 1 : iter
distance = ( x - c ).^2 * sum_mat;
yy = ( distance==min(distance , [] , 2) );
c0 = x' * yy ./ repmat(sum(yy) , k * p , 1);
c = repmat( reshape(c0(1 : p , :) , 1 , k * p) , n , 1);
end
end