三、DBN
3.1 生成模型
深度信念网络是一个生成模型,用来生成符合特定分布的样本。隐变量用来描述在可观测变量之间的高阶相关性。假如加入服从分布 𝑝(𝑣)的训练数据,通过训练得到一个深度信念网络。
生成样本时,先在最顶两层进行足够多的吉布斯采样,在达到热平衡时生成样本ℎ^((𝐿−1)),然后依次计算下一层隐变量的分布。因为在给定上一层变量取值时,下一层的变量是条件独立的,故可独立采样。这样,从第𝐿−1层开始,自顶向下进行逐层采样,最终得到可观测层的样本。
3.2 参数学习
深度信念网络最直接的训练方式是最大化可观测变量的边际分布𝑝(𝑣)在训练集上的似然 。但是在深度信念网络中,隐变量ℎ之间的关系十分复杂,由于“贡献度分配问题”,很难直接学习。即使对于简单的单层Sigmoid信念网络:
p ( v = 1 ∣ h ) = σ ( b + ω T h ) p(v=1 \mid h)=\sigma\left(b+\omega^{T} h\right) p(v=1∣h)=σ(b+ωTh)
在已知可观测变量时,其隐变量的联合后验概率𝑝(ℎ|𝑣)不再相互独立,因此很难估计所有隐变量的后验概率,早期深度信念网络的后验概率一般通过蒙特卡洛方法或变分方法来近似估计,但效率低,从而导致其参数学习比较困难。
为了有效训练深度信念网络,我们将每一层的Sigmoid信念网络转换为受限玻尔兹曼机,这样做的好处是隐变量的后验概率事相互独立的,从而可容易进行采样。这样,深度信念网络可看做多个受限玻尔兹曼机从下到上进行堆叠,第𝑙层受限玻尔兹曼机的隐层作为第𝑙+1受限玻尔兹曼机的可观测层。进一步,深度信念网络可采用逐层训练的方式来快速训练,即从最底层开始,每次只训练一层,直到最后一层。
深度信念网络的训练过程可分为逐层预训练和精调两个阶段,先通过逐层预训练将模型的参数初始化为较优的值,然后通过传统机器学习方法对参数进行精调。
3.2 逐层预训练
采用逐层训练的方式,将深度信念网络的训练简化为对多个受限玻尔兹曼机的训练。具体的逐层训练过程为自下而上依次训练每一层的首先玻尔兹曼机。假设已训练好前𝑙−1层的受限玻尔兹曼机,可计算隐变量自下而上的条件概率:
p ( h ( i ) ∣ h ( i − 1 ) ) = σ ( b i + W ( i ) h ( i − 1 ) ) , 1 ≤ i ≤ ( l − 1 ) p\left(h^{(i)} \mid h^{(i-1)}\right)=\sigma\left(b^{i}+W^{(i)} h^{(i-1)}\right), 1 \leq i \leq(l-1) p(h(i)∣h(i−1))=σ(bi+W(i)h(i−1)),1≤i≤(l−1)
这样可按照
𝑣
=
h
(
0
)
→
⋯
→
h
(
𝑙
−
1
)
𝑣=ℎ^{(0)}→⋯→ℎ^{(𝑙−1)}
v=h(0)→⋯→h(l−1)的顺序生成一组
h
𝑙
−
1
ℎ^{𝑙−1}
hl−1的样本,记为
H
(
l
−
1
)
=
h
(
l
,
1
)
,
.
.
.
,
h
(
l
,
M
)
H^{(l-1)} = h^{(l,1)},...,h^{(l,M)}
H(l−1)=h(l,1),...,h(l,M)。然后将
h
(
𝑙
−
1
)
ℎ^{(𝑙−1)}
h(l−1) 和
h
(
𝑙
)
ℎ^{(𝑙)}
h(l)组成一个受限玻尔兹曼机,用
𝐻
(
𝑙
−
1
)
𝐻^{(𝑙−1)}
H(l−1)作为训练集充分训练第𝑙层的受限玻尔兹曼机.
大量实践表明,逐层预训练可以产生非常好的参数初始值,从而极大地降低了模型的学习难度。
3.3 精调
经过预训练,再结合具体的任务(监督或无监督学习),通过传统的全局学习算法对网络进行精调(fine-tuning),使模型收敛到更好的局部最优点。
作为生成模型的精调:除了顶层的受限玻尔兹曼机,其它层之间的权重可以被分为向上的认知权重(Recognition Weight) 𝑊 ∗ 𝑊^* W∗和向下的生成权重(Generative Weight)𝑊。认知权重用来计算后验概率,生成权重用来定义模型,认知权重初始值 𝑊(‘𝑙)=𝑊(𝑙)𝑇
深度信念网络一般采用 Contrastive Wake-Sleep算法进行精调:
1.Wake阶段:认知过程,通过外界输入(可观测变量)和向上的认知权重,计算每一层隐变量的后验概率并采样。修改下行的生成权重使得下一层的变量的后验概率最大。
2.Sleep阶段:生成过程,通过顶层的采样和向下的生成权重,逐层计算每一层的后验概率并采样。然后,修改向上的认知权重使得上一层变量的后验概率最大。
3.交替进行Wake和Sleep过程,直到收敛
作为判别模型的精调 :深度信念网络的一个应用是作为深度神经网络的预训练模型,提供神经网络的初始权重,这时只需要向上的认知权重,作为判别模型使用:
四 代码演示
import sys
import numpy
numpy.seterr(all='ignore')
def sigmoid(x):
return 1. / (1 + numpy.exp(-x))
def softmax(x):
e = numpy.exp(x - numpy.max(x)) # prevent overflow
if e.ndim == 1:
return e / numpy.sum(e, axis=0)
else:
return e / numpy.array([numpy.sum(e, axis=1)]).T # ndim = 2
class DBN(object):
def __init__(self, input=None, label=None,\
n_ins=2, hidden_layer_sizes=[3, 3], n_outs=2,\
numpy_rng=None):
self.x = input
self.y = label
self.sigmoid_layers = []
self.rbm_layers = []
self.n_layers = len(hidden_layer_sizes) # = len(self.rbm_layers)
if numpy_rng is None:
numpy_rng = numpy.random.RandomState(1234)
assert self.n_layers > 0
# construct multi-layer
for i in range(self.n_layers):
# layer_size
if i == 0:
input_size = n_ins
else:
input_size = hidden_layer_sizes[i - 1]
# layer_input
if i == 0:
layer_input = self.x
else:
layer_input = self.sigmoid_layers[-1].sample_h_given_v()
# print('=============')
# construct sigmoid_layer
sigmoid_layer = HiddenLayer(input=layer_input,
n_in=input_size,
n_out=hidden_layer_sizes[i],
numpy_rng=numpy_rng,
activation=sigmoid)
self.sigmoid_layers.append(sigmoid_layer)
# construct rbm_layer
rbm_layer = RBM(input=layer_input,
n_visible=input_size,
n_hidden=hidden_layer_sizes[i],
W=sigmoid_layer.W, # W, b are shared
hbias=sigmoid_layer.b)
self.rbm_layers.append(rbm_layer)
# layer for output using Logistic Regression
self.log_layer = LogisticRegression(input=self.sigmoid_layers[-1].sample_h_given_v(),
label=self.y,
n_in=hidden_layer_sizes[-1],
n_out=n_outs)
# finetune cost: the negative log likelihood of the logistic regression layer
self.finetune_cost = self.log_layer.negative_log_likelihood()
def pretrain(self, lr=0.1, k=1, epochs=100):
# pre-train layer-wise
# print('pre-training')
for i in range(self.n_layers):
if i == 0:
layer_input = self.x
else:
layer_input = self.sigmoid_layers[i-1].sample_h_given_v(layer_input)
# print(layer_input)
rbm = self.rbm_layers[i]
for epoch in range(epochs):
rbm.contrastive_divergence(lr=lr, k=k, input=layer_input)
# cost = rbm.get_reconstruction_cross_entropy()
# print >> sys.stderr, \
# 'Pre-training layer %d, epoch %d, cost ' %(i, epoch), cost
# def pretrain(self, lr=0.1, k=1, epochs=100):
# # pre-train layer-wise
# for i in range(self.n_layers):
# rbm = self.rbm_layers[i]
# for epoch in range(epochs):
# layer_input = self.x
# for j in range(i):
# layer_input = self.sigmoid_layers[j].sample_h_given_v(layer_input)
# rbm.contrastive_divergence(lr=lr, k=k, input=layer_input)
# # cost = rbm.get_reconstruction_cross_entropy()
# # print >> sys.stderr, \
# # 'Pre-training layer %d, epoch %d, cost ' %(i, epoch), cost
def finetune(self, lr=0.1, epochs=100):
# print('finetune')
layer_input = self.sigmoid_layers[-1].sample_h_given_v()
# train log_layer
epoch = 0
done_looping = False
while (epoch < epochs) and (not done_looping):
self.log_layer.train(lr=lr, input=layer_input)
# self.finetune_cost = self.log_layer.negative_log_likelihood()
# print >> sys.stderr, 'Training epoch %d, cost is ' % epoch, self.finetune_cost
lr *= 0.95
epoch += 1
def predict(self, x):
layer_input = x
for i in range(self.n_layers):
sigmoid_layer = self.sigmoid_layers[i]
# rbm_layer = self.rbm_layers[i]
layer_input = sigmoid_layer.output(input=layer_input)
out = self.log_layer.predict(layer_input)
return out
class HiddenLayer(object):
def __init__(self, input, n_in, n_out,\
W=None, b=None, numpy_rng=None, activation=numpy.tanh):
if numpy_rng is None:
numpy_rng = numpy.random.RandomState(1234)
if W is None:
a = 1. / n_in
initial_W = numpy.array(numpy_rng.uniform( # initialize W uniformly
low=-a,
high=a,
size=(n_in, n_out)))
W = initial_W
if b is None:
b = numpy.zeros(n_out) # initialize bias 0
self.numpy_rng = numpy_rng
self.input = input
self.W = W
self.b = b
self.activation = activation
def output(self, input=None):
if input is not None:
self.input = input
linear_output = numpy.dot(self.input, self.W) + self.b
return (linear_output if self.activation is None
else self.activation(linear_output))
def sample_h_given_v(self, input=None):
if input is not None:
self.input = input
v_mean = self.output()
# print('v_mean:\n',v_mean)
h_sample = self.numpy_rng.binomial(size=v_mean.shape,
n=1,
p=v_mean)
# print('h_sample:\n',h_sample)
return h_sample
class RBM(object):
def __init__(self, input=None, n_visible=2, n_hidden=3, \
W=None, hbias=None, vbias=None, numpy_rng=None):
self.n_visible = n_visible # num of units in visible (input) layer
self.n_hidden = n_hidden # num of units in hidden layer
if numpy_rng is None:
numpy_rng = numpy.random.RandomState(1234) # 使用RandomState获得随机数生成器。
if W is None:
a = 1. / n_visible
initial_W = numpy.array(numpy_rng.uniform( # initialize W uniformly
low=-a,
high=a,
size=(n_visible, n_hidden))) # 生成权重矩阵
W = initial_W
if hbias is None:
hbias = numpy.zeros(n_hidden) # initialize h bias 0
if vbias is None:
vbias = numpy.zeros(n_visible) # initialize v bias 0
self.numpy_rng = numpy_rng
self.input = input
self.W = W
self.hbias = hbias
self.vbias = vbias
# self.params = [self.W, self.hbias, self.vbias]
def contrastive_divergence(self, lr=0.1, k=1, input=None):
if input is not None:
self.input = input
''' CD-k '''
ph_mean, ph_sample = self.sample_h_given_v(self.input)
chain_start = ph_sample
for step in range(k):
if step == 0:
nv_means, nv_samples,\
nh_means, nh_samples = self.gibbs_hvh(chain_start)
else:
nv_means, nv_samples,\
nh_means, nh_samples = self.gibbs_hvh(nh_samples)
# nv_means, nv_samples, nh_means, nh_samples = self.gibbs_hvh(chain_start)
# for _ in range(1, k):
# nv_means, nv_samples, nh_means, nh_samples = self.gibbs_hvh(nh_samples)
# chain_end = nv_samples
self.W += lr * (numpy.dot(self.input.T, ph_sample)
- numpy.dot(nv_samples.T, nh_means))
self.vbias += lr * numpy.mean(self.input - nv_samples, axis=0)
self.hbias += lr * numpy.mean(ph_sample - nh_means, axis=0)
# cost = self.get_reconstruction_cross_entropy()
# return cost
def sample_h_given_v(self, v0_sample):
h1_mean = self.propup(v0_sample) #
h1_sample = self.numpy_rng.binomial(size=h1_mean.shape, # discrete: binomial 二项分布
n=1,
p=h1_mean)
return [h1_mean, h1_sample]
def sample_v_given_h(self, h0_sample):
v1_mean = self.propdown(h0_sample)
v1_sample = self.numpy_rng.binomial(size=v1_mean.shape, # discrete: binomial
n=1,
p=v1_mean)
return [v1_mean, v1_sample]
def propup(self, v):
# 返回隐藏层被激活的概率,应该是一个矩阵,对应的是每一个神经元被激活的概率
pre_sigmoid_activation = numpy.dot(v, self.W) + self.hbias
return sigmoid(pre_sigmoid_activation)
def propdown(self, h):
pre_sigmoid_activation = numpy.dot(h, self.W.T) + self.vbias
return sigmoid(pre_sigmoid_activation)
def gibbs_hvh(self, h0_sample):
v1_mean, v1_sample = self.sample_v_given_h(h0_sample)
h1_mean, h1_sample = self.sample_h_given_v(v1_sample)
return [v1_mean, v1_sample, h1_mean, h1_sample]
def get_reconstruction_cross_entropy(self):
pre_sigmoid_activation_h = numpy.dot(self.input, self.W) + self.hbias
sigmoid_activation_h = sigmoid(pre_sigmoid_activation_h)
pre_sigmoid_activation_v = numpy.dot(sigmoid_activation_h, self.W.T) + self.vbias
sigmoid_activation_v = sigmoid(pre_sigmoid_activation_v)
cross_entropy = - numpy.mean(
numpy.sum(self.input * numpy.log(sigmoid_activation_v) +
(1 - self.input) * numpy.log(1 - sigmoid_activation_v),
axis=1))
return cross_entropy
def reconstruct(self, v):
h = sigmoid(numpy.dot(v, self.W) + self.hbias)
reconstructed_v = sigmoid(numpy.dot(h, self.W.T) + self.vbias)
return reconstructed_v
class LogisticRegression(object):
def __init__(self, input, label, n_in, n_out):
self.x = input
self.y = label
self.W = numpy.zeros((n_in, n_out)) # initialize W 0
self.b = numpy.zeros(n_out) # initialize bias 0
def train(self, lr=0.1, input=None, L2_reg=0.00):
if input is not None:
self.x = input
p_y_given_x = softmax(numpy.dot(self.x, self.W) + self.b)
d_y = self.y - p_y_given_x
self.W += lr * numpy.dot(self.x.T, d_y) - lr * L2_reg * self.W
self.b += lr * numpy.mean(d_y, axis=0)
def negative_log_likelihood(self):
sigmoid_activation = softmax(numpy.dot(self.x, self.W) + self.b)
cross_entropy = - numpy.mean(
numpy.sum(self.y * numpy.log(sigmoid_activation) +
(1 - self.y) * numpy.log(1 - sigmoid_activation),
axis=1))
return cross_entropy
def predict(self, x):
return softmax(numpy.dot(x, self.W) + self.b)
def test_dbn(pretrain_lr=0.1, pretraining_epochs=1000, k=1, \
finetune_lr=0.1, finetune_epochs=200):
x = numpy.array([[1,1,1,0,0,0],
[1,0,1,0,0,0],
[1,1,1,0,0,0],
[0,0,1,1,1,0],
[0,0,1,1,0,0],
[0,0,1,1,1,0]])
y = numpy.array([[1, 0],
[1, 0],
[1, 0],
[0, 1],
[0, 1],
[0, 1]])
rng = numpy.random.RandomState(123)
# construct DBN
# print('construct DBN')
dbn = DBN(input=x, label=y, n_ins=6, hidden_layer_sizes=[3, 3], n_outs=2, numpy_rng=rng)
# pre-training (TrainUnsupervisedDBN)
# k是gibbs的次数
dbn.pretrain(lr=pretrain_lr, k=1, epochs=pretraining_epochs)
# fine-tuning (DBNSupervisedFineTuning)
dbn.finetune(lr=finetune_lr, epochs=finetune_epochs)
# test
x = numpy.array([1, 1, 0, 0, 0, 0])
print(dbn.predict(x))
if __name__ == "__main__":
test_dbn()
[0.72344411 0.27655589]