一 神经网络的复习
这一张主要是对上一本书《深度学习入门》内容的总结,复习可戳https://blog.csdn.net/qq_45461377/article/details/124915144?spm=1001.2014.3001.5501
二 自然语言和单词的分布式表示
2.1什么是自然语言处理
自然语言处理是让计算机处理自然语言的技术。自然语言是柔软的,而计算机语言(编程)是硬的,让计算机读懂并处理自然语言的技术目前主要分为三类:1、基于同义词词典的方法;2、基于计数的方法;3基于推理的方法(word2vec)。本章将先介绍前两种。
2.2同义词词典
同义词词典是人为定义的单词之间粒度关系的词典,把意思相近的单词放得更近,同时表示单词之间上下、左右关系的词典。
最著名的同义词词典是WordNet。
同义词词典存在的问题:1、难以顺应时代变化;2、人工成本高;3、无法表示单词的微妙差异。
2.3基于计数的方法
基于计数的方法使用语料库来处理自然语言。
2.3.1基于python的语料库的预处理
text='You say goodbye and I say hello.'
text=text.lower()
text=text.replace('.',' .')
print(text)
输出:you say goodbye and i say hello .
words=text.split(' ')
print(words)
输出:[‘you’, ‘say’, ‘goodbye’, ‘and’, ‘i’, ‘say’, ‘hello’, ‘.’]
word_to_id={}
id_to_word={}
for word in words:
if word not in word_to_id:
new_id=len(word_to_id)
word_to_id[word]=new_id
id_to_word[new_id]=word
print(id_to_word)
print(word_to_id)
输出:{0: ‘you’, 1: ‘say’, 2: ‘goodbye’, 3: ‘and’, 4: ‘i’, 5: ‘hello’, 6: ‘.’}
{‘you’: 0, ‘say’: 1, ‘goodbye’: 2, ‘and’: 3, ‘i’: 4, ‘hello’: 5, ‘.’: 6}
print(id_to_word[1])
print(word_to_id['hello'])
输出:say
5
import numpy as np
corpus=[word_to_id[w] for w in words]
corpus=np.array(corpus)
print(corpus)
输出:[0 1 2 3 4 1 5 6]
把上面的部分合成一个函数得:
def preprocess(text):
text=text.lower()
text=text.replace('.',' .')
word_to_id={}
id_to_word={}
for word in words:
if word not in word_to_id:
new_id=len(word_to_id)
word_to_id[word]=new_id
id_to_word[new_id]=word
corpus=[word_to_id[w] for w in words]
return corpus, word_to_id, id_to_word
text='You say goodbye and I say hello.'
corpus, word_to_id, id_to_word=preprocess(text)
2.3.2单词的分布式表示
将颜色表示类比到单词表示,颜色可以用向量来表示,同样地,单词也可以。
2.3.3分布式假设
分布式假设指某个单词的含义由他周围的单词决定。
上下文指某个单词周围的单词。
窗口大小指周围的单词有多少个。
2.3.4共现矩阵
print(corpus)
print(id_to_word)
输出:[0, 1, 2, 3, 4, 1, 5, 6]
{0: ‘you’, 1: ‘say’, 2: ‘goodbye’, 3: ‘and’, 4: ‘i’, 5: ‘hello’, 6: ‘.’}
def create_co_matrix(corpus,vocab_size,window_size=1):
corpus_size=len(corpus)
co_matrix=np.zeros((vocab_size,vocab_size),dtype=np.int32)
for idx, word_id in enumerate(corpus):
for i in range(1,window_size+1):
left_idx=idx-i
right_idx=idx+i
if left_idx>=0:
left_word_id=corpus[left_idx]
co_matrix[word_id,left_word_id]+=1
if right_idx<corpus_size:
right_word_id=corpus[right_idx]
co_matrix[word_id,right_word_id]+=1
return co_matrix
print(create_co_matrix(corpus,vocab_size=len(word_to_id)))
输出:[[0 1 0 0 0 0 0]
[1 0 1 0 1 1 0]
[0 1 0 1 0 0 0]
[0 0 1 0 1 0 0]
[0 1 0 1 0 0 0]
[0 1 0 0 0 0 1]
[0 0 0 0 0 1 0]]
2.3.5向量间的相似度
余弦相似度表示两个向量在多大程度上指向同一方向。
L2范数指向量个元素的平方和的平方根。
def cos_similarity(x,y):
nx=x/np.sqrt(np.sum(x**2))
ny=y/np.sqrt(np.sum(y**2))
return np.dot(nx,ny)
def cos_similarity(x,y,eps=1e-8):
nx=x/np.sqrt(np.sum(x**2)+eps)
ny=y/np.sqrt(np.sum(y**2)+eps)
return np.dot(nx,ny)
C=create_co_matrix(corpus,vocab_size=len(word_to_id))
c0=C[word_to_id['you']]
c1=C[word_to_id['i']]
print(cos_similarity(c0,c1))
输出:0.7071067758832467
2.3.6相似单词的排序
定义一个查询词相似的单词按降序显示出来的函数。
def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
# 取出查询词
if query not in word_to_id:
print('%s is not found' % query)
return
print('\n[query] ' + query)
query_id = word_to_id[query]
query_vec = word_matrix[query_id]
# 计算余弦相似度
vocab_size = len(id_to_word)
similarity = np.zeros(vocab_size)
for i in range(vocab_size):
similarity[i] = cos_similarity(word_matrix[i], query_vec)
# 基于余弦相似度,按降序输出值
count = 0
for i in (-1 * similarity).argsort():
if id_to_word[i] == query:
continue
print(' %s: %s' % (id_to_word[i], similarity[i]))
count += 1
if count >= top:
return
x=np.array([100,-20,2])
x.argsort()
(-x).argsort()
输出:array([0, 2, 1])
most_similar('you', word_to_id, id_to_word, C)
输出为:
[query] you
goodbye: 0.7071067758832467
i: 0.7071067758832467
hello: 0.7071067758832467
say: 0.0
and: 0.0
2.4基于计数的方法的改进
基于计数的方法使用语料库来处理自然语言。
2.4.1点互信息
点互信息PMI,用来表示词之间的相关性。
正的点互信息PPMI,为解决共出现为0的情况。
def ppmi(C,verbose=False,eps=1e-8):
M=np.zeros_like(C,dtype=np.float32)
N=np.sum(C)
S=np.sum(C,axis=0)
total=C.shape[0]*C.shape[1]
cnt=0
for i in range(C.shape[0]):
for j in range(C.shape[1]):
pmi=np.log2(C[i,j]*N/S[j]*S[i]+eps)
M[i,j]=max(0,pmi)
if verbose:
cnt+=1
if cnt%(total//100+1)==0:
print('%.1f%% done'%(100*cnt/total))
return M
W=ppmi(C)
print(W)
输出:[[0. 1.8073549 0. 0. 0. 0. 0. ]
[5.807355 0. 4.807355 0. 4.807355 4.807355 0. ]
[0. 2.807355 0. 3.807355 0. 0. 0. ]
[0. 0. 3.807355 0. 3.807355 0. 0. ]
[0. 2.807355 0. 3.807355 0. 0. 0. ]
[0. 2.807355 0. 0. 0. 0. 4.807355 ]
[0. 0. 0. 0. 0. 2.807355 0. ]]
PPMI矩阵的问题,随着语料库词汇量的增加,各单词向量的维度也会增加。
2.4.2降维
降维指保留重要信息,降低向量维度。
降维可以将稀疏矩阵转化为密集矩阵。
本书举例奇异值分解法(SVD),其可将任意矩阵转化为3个矩阵的乘积。
2.4.3基于SVD的降维
U,S,V=np.linalg.svd(W)
print(C[0]) #共现矩阵
print(W[0]) #PPMI矩阵
print(U[0]) #SVD矩阵
print(U[0,:2]) #降维为2维
输出:[0 1 0 0 0 0 0]
[0. 1.8073549 0. 0. 0. 0. 0. ]
[ 0.0000000e+00 1.7478172e-01 3.8205594e-02 -1.1102230e-16
-1.1102230e-16 -9.8386568e-01 -1.2717195e-16]
[0. 0.17478172]
import matplotlib.pyplot as plt
for word, word_id in word_to_id.items():
plt.annotate(word, (U[word_id,0], U[word_id,1]))
plt.scatter(U[:,0],U[:,1],alpha=0.5)
plt.show()
2.4.4PBT数据集
from dataset import ptb
corpus,word_to_id,id_to_word=ptb.load_data('train')
print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['car']:", word_to_id['car'])
print("word_to_id['happy']:", word_to_id['happy'])
print("word_to_id['lexus']:", word_to_id['lexus'])
输出:corpus size: 929589
corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]
id_to_word[0]: aer
id_to_word[1]: banknote
id_to_word[2]: berlitz
word_to_id[‘car’]: 3856
word_to_id[‘happy’]: 4428
word_to_id[‘lexus’]: 7426
2.4.5基于PTB数据集的评价
window_size = 2
wordvec_size = 100
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
print('counting co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)
print('calculating SVD ...')
try:
# truncated SVD (fast!)
from sklearn.utils.extmath import randomized_svd
U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
random_state=None)
except ImportError:
# SVD (slow)
U, S, V = np.linalg.svd(W)
word_vecs = U[:, :wordvec_size]
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
输出:Output exceeds the size limit. Open the full output data in a text editor
counting co-occurrence …
calculating PPMI …
1.0% done
2.0% done
3.0% done
4.0% done
5.0% done
6.0% done
7.0% done
8.0% done
9.0% done
10.0% done
11.0% done
12.0% done
13.0% done
14.0% done
15.0% done
16.0% done
17.0% done
18.0% done
19.0% done
20.0% done
21.0% done
22.0% done
23.0% done
…
motor: 0.7355839014053345
squibb: 0.6746079921722412
transmission: 0.6652272343635559
eli: 0.6599057912826538
2.5小结
本章我们重点学习了单词的向量表示的发展,从预处理到共现矩阵,再到表示单词间关系的PPMI矩阵,最后到将为的SVD矩阵,学会处理词向量,是这个部分代码所重点强调的。
三 word2vec
3.1基于推理的方法和神经网络
上一章我们说到表示词向量流行的方法有两种:基于计数的方法和基于推理的方法,两者都是基于分布式假设,这一章就来介绍基于推理的方法。
3.1.1基于计数的方法的问题
消耗大量的计算资源和时间。
基于计数的方法一次性处理全部学习数据,而基于推理的方法使用部分学习数据逐步学习。
3.1.2基于推理的方法概要
分布式假设:单词含义由周围单词构成。
3.1.3神经网络中单词的处理方法
通过one-hot矩阵进行表示,再进入Matmul层。
import numpy as np
c=np.array([[1,0,0,0,0,0,0]]) #输入
W=np.random.randn(7,3) #权重
h=np.dot(c,W) #中间节点
print(h)
输出:[[-0.70770067 -0.12952097 0.2498067 ]]
class MatMul:
def __init__(self, W):
self.params = [W]
self.grads = [np.zeros_like(W)]
self.x = None
def forward(self, x):
W, = self.params
out = np.dot(x, W)
self.x = x
return out
def backward(self, dout):
W, = self.params
dx = np.dot(dout, W.T)
dW = np.dot(self.x.T, dout)
self.grads[0][...] = dW
return dx
layer=MatMul(W)
h=layer.forward(c)
print(h)
输出:[[-0.70770067 -0.12952097 0.2498067 ]]
class MatMul:
def __init__(self, W):
self.params = [W]
self.grads = [np.zeros_like(W)]
self.x = None
def forward(self, x):
W, = self.params
out = np.dot(x, W)
self.x = x
return out
def backward(self, dout):
W, = self.params
dx = np.dot(dout, W.T)
dW = np.dot(self.x.T, dout)
self.grads[0][...] = dW
return dx
layer=MatMul(W)
h=layer.forward(c)
print(h)
输出为:[[-0.70770067 -0.12952097 0.2498067 ]]
两种方式得到的结果一样,但为了后面模型的构建,采用第二种MatMul层的方法。·
3.2简单的word2vec
宜CBOW模型为例进行本章的学习。
3.2.1CBOW模型的推理
CBOW模型是根据上下文预测目标词的神经网络,逻辑是 输入层(编码)中间层(解码)输出层。
# 样本的上下文数据
c0 = np.array([[1, 0, 0, 0, 0, 0, 0]])
c1 = np.array([[0, 0, 1, 0, 0, 0, 0]])
# 权重初始化
W_in = np.random.randn(7, 3)
W_out = np.random.randn(3, 7)
# 生成层
in_layer0 = MatMul(W_in)
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)
# 正向传播
h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)
s = out_layer.forward(h)
print(s)
输出为:[[-1.65562246 1.56105405 -0.34916877 1.04647459 2.04680116 -1.805466
0.90635499]]
3.2.2CBOW模型的学习
模型是个多类别分类的神经网络,用softmax函数和交叉熵误差。
3.2.3word2vec的权重和分布式表示
word2vec用输入层权重Win作为最终的单词分布式表示。
3.3学习数据的准备
3.3.1上下文和目标词
当向神经网络输入上下文时,使目标词出现的概率高
def create_contexts_target(corpus, window_size=1):
target = corpus[window_size:-window_size]
contexts = []
for idx in range(window_size, len(corpus)-window_size):
cs = []
for t in range(-window_size, window_size + 1):
if t == 0:
continue
cs.append(corpus[idx + t])
contexts.append(cs)
return np.array(contexts), np.array(target)
contexts,target=create_contexts_target(corpus, window_size=1)
print(contexts)
print(target)
输出为:[[0 2]
[1 3]
[2 4]
[3 1]
[4 5]
[1 6]]
[1 2 3 4 1 5]
3.3.2转化为one-hot表示
convert_one_hot()函数将上下文和目标词转化成了one-hot。
def convert_one_hot(corpus, vocab_size):
N = corpus.shape[0]
if corpus.ndim == 1:
one_hot = np.zeros((N, vocab_size), dtype=np.int32)
for idx, word_id in enumerate(corpus):
one_hot[idx, word_id] = 1
elif corpus.ndim == 2:
C = corpus.shape[1]
one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
for idx_0, word_ids in enumerate(corpus):
for idx_1, word_id in enumerate(word_ids):
one_hot[idx_0, idx_1, word_id] = 1
return one_hot
vocab_size=len(word_to_id)
contexts=convert_one_hot(contexts,vocab_size)
target=convert_one_hot(target,vocab_size)
3.4CBOW模型的实现
def softmax(x):
if x.ndim == 2:
x = x - x.max(axis=1, keepdims=True)
x = np.exp(x)
x /= x.sum(axis=1, keepdims=True)
elif x.ndim == 1:
x = x - np.max(x)
x = np.exp(x) / np.sum(np.exp(x))
return x
def cross_entropy_error(y, t):
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
if t.size == y.size:
t = t.argmax(axis=1)
batch_size = y.shape[0]
return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size
class SoftmaxWithLoss:
def __init__(self):
self.params, self.grads = [], []
self.y = None
self.t = None
def forward(self, x, t):
self.t = t
self.y = softmax(x)
if self.t.size == self.y.size:
self.t = self.t.argmax(axis=1)
loss = cross_entropy_error(self.y, self.t)
return loss
def backward(self, dout=1):
batch_size = self.t.shape[0]
dx = self.y.copy()
dx[np.arange(batch_size), self.t] -= 1
dx *= dout
dx = dx / batch_size
return dx
class SimpleCBOW:
def __init__(self, vocab_size, hidden_size):
V, H = vocab_size, hidden_size
# 初始化权重
W_in = 0.01 * np.random.randn(V, H).astype('f')
W_out = 0.01 * np.random.randn(H, V).astype('f')
# 生成层
self.in_layer0 = MatMul(W_in)
self.in_layer1 = MatMul(W_in)
self.out_layer = MatMul(W_out)
self.loss_layer = SoftmaxWithLoss()
# 将所有的权重和梯度整理到列表中
layers = [self.in_layer0, self.in_layer1, self.out_layer]
self.params, self.grads = [], []
for layer in layers:
self.params += layer.params
self.grads += layer.grads
# 将单词的分布式表示设置为成员变量
self.word_vecs = W_in
def forward(self, contexts, target):
h0 = self.in_layer0.forward(contexts[:, 0])
h1 = self.in_layer1.forward(contexts[:, 1])
h = (h0 + h1) * 0.5
score = self.out_layer.forward(h)
loss = self.loss_layer.forward(score, target)
return loss
def backward(self, dout=1):
ds = self.loss_layer.backward(dout)
da = self.out_layer.backward(ds)
da *= 0.5
self.in_layer1.backward(da)
self.in_layer0.backward(da)
return None
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.iter = 0
self.m = None
self.v = None
def update(self, params, grads):
if self.m is None:
self.m, self.v = [], []
for param in params:
self.m.append(np.zeros_like(param))
self.v.append(np.zeros_like(param))
self.iter += 1
lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)
for i in range(len(params)):
self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)
import numpy
import time
import matplotlib.pyplot as plt
import numpy as np
def clip_grads(grads, max_norm):
total_norm = 0
for grad in grads:
total_norm += np.sum(grad ** 2)
total_norm = np.sqrt(total_norm)
rate = max_norm / (total_norm + 1e-6)
if rate < 1:
for grad in grads:
grad *= rate
def remove_duplicate(params, grads):
params, grads = params[:], grads[:] # copy list
while True:
find_flg = False
L = len(params)
for i in range(0, L - 1):
for j in range(i + 1, L):
if params[i] is params[j]:
grads[i] += grads[j]
find_flg = True
params.pop(j)
grads.pop(j)
elif params[i].ndim == 2 and params[j].ndim == 2 and \
params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
grads[i] += grads[j].T
find_flg = True
params.pop(j)
grads.pop(j)
if find_flg: break
if find_flg: break
if not find_flg: break
return params, grads
class Trainer:
def __init__(self, model, optimizer):
self.model = model
self.optimizer = optimizer
self.loss_list = []
self.eval_interval = None
self.current_epoch = 0
def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
data_size = len(x)
max_iters = data_size // batch_size
self.eval_interval = eval_interval
model, optimizer = self.model, self.optimizer
total_loss = 0
loss_count = 0
start_time = time.time()
for epoch in range(max_epoch):
idx = numpy.random.permutation(numpy.arange(data_size))
x = x[idx]
t = t[idx]
for iters in range(max_iters):
batch_x = x[iters*batch_size:(iters+1)*batch_size]
batch_t = t[iters*batch_size:(iters+1)*batch_size]
loss = model.forward(batch_x, batch_t)
model.backward()
params, grads = remove_duplicate(model.params, model.grads)
if max_grad is not None:
clip_grads(grads, max_grad)
optimizer.update(params, grads)
total_loss += loss
loss_count += 1
if (eval_interval is not None) and (iters % eval_interval) == 0:
avg_loss = total_loss / loss_count
elapsed_time = time.time() - start_time
print('| epoch %d | iter %d / %d | time %d[s] | loss %.2f'
% (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, avg_loss))
self.loss_list.append(float(avg_loss))
total_loss, loss_count = 0, 0
self.current_epoch += 1
def plot(self, ylim=None):
x = numpy.arange(len(self.loss_list))
if ylim is not None:
plt.ylim(*ylim)
plt.plot(x, self.loss_list, label='train')
plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
plt.ylabel('loss')
plt.show()
window_size = 1
hidden_size = 5
batch_size = 3
max_epoch = 1000
text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)
vocab_size = len(word_to_id)
contexts, target = create_contexts_target(corpus, window_size)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)
model = SimpleCBOW(vocab_size, hidden_size)
optimizer = Adam()
trainer = Trainer(model, optimizer)
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()
word_vecs = model.word_vecs
for word_id, word in id_to_word.items():
print(word, word_vecs[word_id])
输出为:Output exceeds the size limit. Open the full output data in a text editor
| epoch 1 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 2 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 3 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 4 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 5 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 6 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 7 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 8 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 9 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 10 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 11 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 12 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 13 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 14 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 15 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 16 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 17 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 18 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 19 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 20 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 21 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 22 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 23 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 24 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 25 | iter 1 / 2 | time 0[s] | loss 1.94
…
and [-0.95682913 1.5921552 -0.9855902 -1.0069962 -1.570275 ]
i [0.82757306 0.49116576 0.8163651 0.8544265 0.8013826 ]
hello [0.9561777 1.4145606 0.95824456 0.97314036 1.1887612 ]
. [-1.0302204 -1.3671738 -0.98632777 -1.0349017 1.4386774 ]
3.5word2vec的补充说明
3.5.1CBOW模型和概率
后验概率就是数学中所学的条件概率。
注意CBOW模型中交叉熵损失的公式推导,学习任务是使之尽可能小。
3.5.2skip-gram模型
用目标词推上下文。
注意skip-gram模型中交叉熵损失的公式推导。
skip-gram的结果更好,CBOW更快。
3.5.3基于计数与基于推理
基于计数,靠相似性;基于推理,靠相似性和模式;两者结果难分上下,互通。
3.6小节
这一章我们学习了基于推理的词向量的表示方法——word2vec,以上下文推中间词的模型CBOW为例子,进行了代码的学习。对于以中间词推上下文的模型skip-gram也进行了原理方面的学习和了解。和基于计数的方法相比,基于推理的方法的进步表现在对数据模式的识别和表示,但其目前的计算速度方面还是存在很大的问题,下一章将对相关技术进行进一步的计算和讨论。
四 word2vec的高速化
上一章中CBOW模型的问题是随词汇量增加,计算量也会增加,导致时间成本高。本章加速会从两个方面进行介绍:1Embedding层的引入;2Negative Sampling的新损失函数的引入。
4.1word2vec的改进1
输入层的one-hot表示和权重矩阵Win的乘积->Embedding
中间层和权重矩阵Wout的乘积以及Softmax层的计算->Negative Sampling
本节介绍Embedding的引入。
4.1.1Embedding层
把MatMul层换成Embedding层,单词的密集向量表示称为词嵌入(Embedding)。
4.1.2Embedding层的实现
import numpy as np
W=np.arange(21).reshape(7,3)
print(W)
输出为:[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[15 16 17]
[18 19 20]]
print(W[2])
# 输出为:[6 7 8]
print(W[5])
# 输出为:[15 16 17]
idx=np.array([1,0,3,0])
print(W[idx])
"""
输出为:[[ 3 4 5]
[ 0 1 2]
[ 9 10 11]
[ 0 1 2]]
"""
class Embedding:
def __init__(self,W):
self.params=[W]
self.grads=[np.zeros_like(W)]
self.idx=None
def forward(self,idx):
W,=self.params
self.idx=idx
out=W[idx]
return out
def backward(self,dout):
dW=self.grads
dW[...]=0
for i,word in enumerate(self.idx):
dW[word_id]+=dout[i]
return None
4.2word2vec的改进2
用负采样(Negative Sampling)替代Softmax,可以使得无论词汇量多大,计算量保持较低或恒定。
4.2.1中间层之后的计算问题
两个问题:
1、中间层的神经元和权重矩阵(Wout)的乘积
2、Softmax层的计算
4.2.2从多分类到二分类
负采样的中心思想是用二分类拟合多分类。
举例:让神经网络回答当上下文是you和goodbye时,目标词是say吗?
回答只需要提取say对应的列(单词向量),并用它与中间层的神经元计算内积即可。
4.2.3sigmoid函数和交叉熵误差
多分类:Softmax函数+交叉熵误差
二分类:Sigmoid函数+交叉熵误差
二分类和多分类交叉熵损失本质一样。
4.2.4从多分类到二分类的实现
1、输入层embedding替换了matmul
2、输出层embedding替换了matmul;sigmoid替换了softmax
class EmbeddingDot:
def __init__(self, W):
self.embed = Embedding(W)
self.params = self.embed.params
self.grads = self.embed.grads
self.cache = None
def forward(self, h, idx):
target_W = self.embed.forward(idx)
out = np.sum(target_W * h, axis=1)
self.cache = (h, target_W)
return out
def backward(self, dout):
h, target_W = self.cache
dout = dout.reshape(dout.shape[0], 1)
dtarget_W = dout * h
self.embed.backward(dtarget_W)
dh = dout * target_W
return dh
4.2.5负采样
当前神经网络只是学习了正例say。
这一节的目标是负例。所有负例都值得学习吗?答案是否定的,只使用少数负例是负采样方法的含义。
当然,负采样也需要加上负例的损失。
4.2.6负采样的采样方法
考虑到计算复杂度,有必要将负例限定在较小范围内(5个或者10个)。
import numpy as np
np.random.choice(10)
# 输出为:3
words=['you','say','goodbye','I','hellow','.']
np.random.choice(words)
# 输出为:‘say’
np.random.choice(words,size=5)
# 输出为:array([‘hellow’, ‘.’, ‘hellow’, ‘I’, ‘I’], dtype=‘<U7’)
np.random.choice(words,size=5,replace=False)
# 输出为:array([‘.’, ‘say’, ‘goodbye’, ‘hellow’, ‘you’], dtype=‘<U7’)
p=[0.5,0.1,0.05,0.2,0.05,0.1]
np.random.choice(words,p=p)
# 输出为:‘you’
p=[0.7,0.29,0.01]
new_p=np.power(p,0.75)
new_p/=np.sum(new_p)
print(new_p)
# 输出为:[0.64196878 0.33150408 0.02652714]
`
```python
import collections
class UnigramSampler:
def __init__(self, corpus, power, sample_size):
self.sample_size = sample_size
self.vocab_size = None
self.word_p = None
counts = collections.Counter()
for word_id in corpus:
counts[word_id] += 1
vocab_size = len(counts)
self.vocab_size = vocab_size
self.word_p = np.zeros(vocab_size)
for i in range(vocab_size):
self.word_p[i] = counts[i]
self.word_p = np.power(self.word_p, power)
self.word_p /= np.sum(self.word_p)
def get_negative_sample(self, target):
batch_size = target.shape[0]
negative_sample = np.zeros((batch_size, self.sample_size), dtype=np.int32)
for i in range(batch_size):
p = self.word_p.copy()
target_idx = target[i]
p[target_idx] = 0
p /= p.sum()
negative_sample[i, :] = np.random.choice(self.vocab_size, size=self.sample_size, replace=False, p=p)
return negative_sample
corpus=np.array([0,1,2,3,4,1,2,3])
power=0.75
sample_size=2
sampler=UnigramSampler(corpus,power,sample_size)
target=np.array([1,3,0])
negative_sample=sampler.get_negative_sample(target)
print(negative_sample)
输出为:[[4 3]
[2 1]
[4 3]]
4.2.7负采样的实现
class SigmoidWithLoss:
def __init__(self):
self.params, self.grads = [], []
self.loss = None
self.y = None
self.t = None
def forward(self, x, t):
self.t = t
self.y = 1 / (1 + np.exp(-x))
self.loss = cross_entropy_error(np.c_[1 - self.y, self.y], self.t)
return self.loss
def backward(self, dout=1):
batch_size = self.t.shape[0]
dx = (self.y - self.t) * dout / batch_size
return dx
class NegativeSamplingLoss:
def __init__(self, W, corpus, power=0.75, sample_size=5):
self.sample_size = sample_size
self.sampler = UnigramSampler(corpus, power, sample_size)
self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]
self.params, self.grads = [], []
for layer in self.embed_dot_layers:
self.params += layer.params
self.grads += layer.grads
def forward(self, h, target):
batch_size = target.shape[0]
negative_sample = self.sampler.get_negative_sample(target)
score = self.embed_dot_layers[0].forward(h, target)
correct_label = np.ones(batch_size, dtype=np.int32)
loss = self.loss_layers[0].forward(score, correct_label)
negative_label = np.zeros(batch_size, dtype=np.int32)
for i in range(self.sample_size):
negative_target = negative_sample[:, i]
score = self.embed_dot_layers[1 + i].forward(h, negative_target)
loss += self.loss_layers[1 + i].forward(score, negative_label)
return loss
def backward(self, dout=1):
dh = 0
for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
dscore = l0.backward(dout)
dh += l1.backward(dscore)
return dh
4.3改进版word2vec的学习
4.3.1CBOW模型的实现
class CBOW:
def __init__(self, vocab_size, hidden_size, window_size, corpus):
V, H = vocab_size, hidden_size
W_in = 0.01 * np.random.randn(V, H).astype('f')
W_out = 0.01 * np.random.randn(V, H).astype('f')
self.in_layers = []
for i in range(2 * window_size):
layer = Embedding(W_in)
self.in_layers.append(layer)
self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)
layers = self.in_layers + [self.ns_loss]
self.params, self.grads = [], []
for layer in layers:
self.params += layer.params
self.grads += layer.grads
self.word_vecs = W_in
def forward(self, contexts, target):
h = 0
for i, layer in enumerate(self.in_layers):
h += layer.forward(contexts[:, i])
h *= 1 / len(self.in_layers)
loss = self.ns_loss.forward(h, target)
return loss
def backward(self, dout=1):
dout = self.ns_loss.backward(dout)
dout *= 1 / len(self.in_layers)
for layer in self.in_layers:
layer.backward(dout)
return None
4.3.2CBOW模型的学习代码
import pickle
window_size = 3
hidden_size = 100
batch_size = 100
max_epoch = 10
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
contexts, target = create_contexts_target(corpus, window_size)
model = CBOW(vocab_size, hidden_size, window_size, corpus)
# model = SkipGram(vocab_size, hidden_size, window_size, corpus)
optimizer = Adam()
trainer = Trainer(model, optimizer)
trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()
word_vecs = model.word_vecs
params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl' # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
pickle.dump(params, f, -1)
4.3.3CBOW模型的评价
def normalize(x):
if x.ndim == 2:
s = np.sqrt((x * x).sum(1))
x /= s.reshape((s.shape[0], 1))
elif x.ndim == 1:
s = np.sqrt((x * x).sum())
x /= s
return x
def analogy(a, b, c, word_to_id, id_to_word, word_matrix, top=5, answer=None):
for word in (a, b, c):
if word not in word_to_id:
print('%s is not found' % word)
return
print('\n[analogy] ' + a + ':' + b + ' = ' + c + ':?')
a_vec, b_vec, c_vec = word_matrix[word_to_id[a]], word_matrix[word_to_id[b]], word_matrix[word_to_id[c]]
query_vec = b_vec - a_vec + c_vec
query_vec = normalize(query_vec)
similarity = np.dot(word_matrix, query_vec)
if answer is not None:
print("==>" + answer + ":" + str(np.dot(word_matrix[word_to_id[answer]], query_vec)))
count = 0
for i in (-1 * similarity).argsort():
if np.isnan(similarity[i]):
continue
if id_to_word[i] in (a, b, c):
continue
print(' {0}: {1}'.format(id_to_word[i], similarity[i]))
count += 1
if count >= top:
return
pkl_file = 'cbow_params.pkl'
# pkl_file = 'skipgram_params.pkl'
with open(pkl_file, 'rb') as f:
params = pickle.load(f)
word_vecs = params['word_vecs']
word_to_id = params['word_to_id']
id_to_word = params['id_to_word']
# most similar task
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
most_similar(query, word_to_id, id_to_word, word_vecs, top=5)
# analogy task
print('-'*50)
analogy('king', 'man', 'queen', word_to_id, id_to_word, word_vecs)
analogy('take', 'took', 'go', word_to_id, id_to_word, word_vecs)
analogy('car', 'cars', 'child', word_to_id, id_to_word, word_vecs)
analogy('good', 'better', 'bad', word_to_id, id_to_word, word_vecs)
4.4word2vec相关的其他话题
4.4.1word2vec的应用例
单词的分布式表示可以用于迁移学习,进而实现文本分类、文本聚类、词性标注和情感分析等自然语言处理任务。
单词的分布式表示的优点可以将单词转化为固定长度的向量。
4.4.2单词向量的评价方法
评价指标有相似度和类推问题。
相关结论:
1、模型不同,精度不同(根据语料库选择最佳的模型)。
2、语料库越大,结果越好(始终需要大数据)。
3、单词向量的维度必须适中。
4.5小结
本章主要针对CBOW的输入输出的高速化进行了优化。
针对输入部分,采用了embedding层。
针对输出部分,采用了负采样损失。
总体的思想是抽取,以实现高速化。
五 RNN
RNN是为了改进前馈网络而生的。
5.1概率和语言模型
5.1.1概率视角下的word2vec
这个部分复习了CBOW模型(由上下文预测中间词),引出了仅有左侧窗口上下文的概率和损失函数,进而引出语言模型。
5.1.2语言模型
语言模型给出了单词序列发生的概率,例子包括机器翻译、语音识别和生成新的句子。
多个时间一起发生的概率称为联合概率,联合概率可以转化为多个概率的乘积。
后验概率是以目标此左侧的全部单词为上下文时的概率。
图5-3表明了语言模型中的后验概率:若以t个单词为目标词,则第t个单词左侧的全部单词构成上下文(条件)。因此语言模型又被称作条件语言模型。
5.1.3将CBOW模型用作语言模型?
马尔可夫性指未来的状态仅依赖于当前的状态。
若用CBOW模型当语言模型会带来两个问题:1忽略了上下文单词顺序问题;2采用拼接的方式,参数量随上下文长度成比例增加。
而RNN能很好地解决这两个问题。它可以处理任意长度的时序数据。
5.2RNN
5.2.1循环的神经网络
循环是“反复并持续”的意思。
RNN一边记住过去的数据,一边更新到最新的数据。
RNN层拥有环路,输出中的一个分叉将成为其自身的输入。
5.2.2展开循环
多个RNN层都是同一层,各个时刻的RNN层接受传给该层的输入和前一个RNN的输出。
RNN是不断地进行输出和输入的(更新)。
5.2.3Backpropagation Through Time
BPTT是按时间顺序展开的神经网络的误差反向传播法。
5.2.4Truncated BPTT
截断的BPTT,只是网络的反向误差被截断,以被截出的网络为单位进行学习。
即正向传播不截断,反向传播为求误差而截断。
5.3RNN的实现
单步处理的层称为RNN层,一次处理T步的层称为Time RNN层。
5.3.1RNN层的实现
import numpy as np
class RNN:
def __init__(self, Wx, Wh, b):
self.params = [Wx, Wh, b]
self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
self.cache = None
def forward(self, x, h_prev):
Wx, Wh, b = self.params
t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b
h_next = np.tanh(t)
self.cache = (x, h_prev, h_next)
return h_next
def backward(self, dh_next):
Wx, Wh, b = self.params
x, h_prev, h_next = self.cache
dt = dh_next * (1 - h_next ** 2)
db = np.sum(dt, axis=0)
dWh = np.dot(h_prev.T, dt)
dh_prev = np.dot(dt, Wh.T)
dWx = np.dot(x.T, dt)
dx = np.dot(dt, Wx.T)
self.grads[0][...] = dWx
self.grads[1][...] = dWh
self.grads[2][...] = db
return dx, dh_prev
5.3.2Time RNN层的实现
class TimeRNN:
def __init__(self, Wx, Wh, b, stateful=False):
self.params = [Wx, Wh, b]
self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
self.layers = None
self.h, self.dh = None, None
self.stateful = stateful
def forward(self, xs):
Wx, Wh, b = self.params
N, T, D = xs.shape
D, H = Wx.shape
self.layers = []
hs = np.empty((N, T, H), dtype='f')
if not self.stateful or self.h is None:
self.h = np.zeros((N, H), dtype='f')
for t in range(T):
layer = RNN(*self.params)
self.h = layer.forward(xs[:, t, :], self.h)
hs[:, t, :] = self.h
self.layers.append(layer)
return hs
def backward(self, dhs):
Wx, Wh, b = self.params
N, T, H = dhs.shape
D, H = Wx.shape
dxs = np.empty((N, T, D), dtype='f')
dh = 0
grads = [0, 0, 0]
for t in reversed(range(T)):
layer = self.layers[t]
dx, dh = layer.backward(dhs[:, t, :] + dh)
dxs[:, t, :] = dx
for i, grad in enumerate(layer.grads):
grads[i] += grad
for i, grad in enumerate(grads):
self.grads[i][...] = grad
self.dh = dh
return dxs
def set_state(self, h):
self.h = h
def reset_state(self):
self.h = None
5.4处理时序数据的层的实现
基于RNN的语言模型称为RNNLM。
5.4.1RNNLM的全貌图
RNNLM可以“记忆”目前为止输入的单词,并以此为基础预测接下来会出现的单词。
5.4.2Time层的实现
含有T个时序数据的层称为“Time XX层。
Time的损失是平均损失。
5.5RNNLM的学习和评价
5.5.1RNNLM的实现
class TimeEmbedding:
def __init__(self, W):
self.params = [W]
self.grads = [np.zeros_like(W)]
self.layers = None
self.W = W
def forward(self, xs):
N, T = xs.shape
V, D = self.W.shape
out = np.empty((N, T, D), dtype='f')
self.layers = []
for t in range(T):
layer = Embedding(self.W)
out[:, t, :] = layer.forward(xs[:, t])
self.layers.append(layer)
return out
def backward(self, dout):
N, T, D = dout.shape
grad = 0
for t in range(T):
layer = self.layers[t]
layer.backward(dout[:, t, :])
grad += layer.grads[0]
self.grads[0][...] = grad
return None
class TimeAffine:
def __init__(self, W, b):
self.params = [W, b]
self.grads = [np.zeros_like(W), np.zeros_like(b)]
self.x = None
def forward(self, x):
N, T, D = x.shape
W, b = self.params
rx = x.reshape(N*T, -1)
out = np.dot(rx, W) + b
self.x = x
return out.reshape(N, T, -1)
def backward(self, dout):
x = self.x
N, T, D = x.shape
W, b = self.params
dout = dout.reshape(N*T, -1)
rx = x.reshape(N*T, -1)
db = np.sum(dout, axis=0)
dW = np.dot(rx.T, dout)
dx = np.dot(dout, W.T)
dx = dx.reshape(*x.shape)
self.grads[0][...] = dW
self.grads[1][...] = db
return dx
class TimeSoftmaxWithLoss:
def __init__(self):
self.params, self.grads = [], []
self.cache = None
self.ignore_label = -1
def forward(self, xs, ts):
N, T, V = xs.shape
if ts.ndim == 3:
ts = ts.argmax(axis=2)
mask = (ts != self.ignore_label)
xs = xs.reshape(N * T, V)
ts = ts.reshape(N * T)
mask = mask.reshape(N * T)
ys = softmax(xs)
ls = np.log(ys[np.arange(N * T), ts])
ls *= mask
loss = -np.sum(ls)
loss /= mask.sum()
self.cache = (ts, ys, mask, (N, T, V))
return loss
def backward(self, dout=1):
ts, ys, mask, (N, T, V) = self.cache
dx = ys
dx[np.arange(N * T), ts] -= 1
dx *= dout
dx /= mask.sum()
dx *= mask[:, np.newaxis]
dx = dx.reshape((N, T, V))
return dx
class SimpleRnnlm:
def __init__(self, vocab_size, wordvec_size, hidden_size):
V, D, H = vocab_size, wordvec_size, hidden_size
rn = np.random.randn
embed_W = (rn(V, D) / 100).astype('f')
rnn_Wx = (rn(D, H) / np.sqrt(D)).astype('f')
rnn_Wh = (rn(H, H) / np.sqrt(H)).astype('f')
rnn_b = np.zeros(H).astype('f')
affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
affine_b = np.zeros(V).astype('f')
self.layers = [
TimeEmbedding(embed_W),
TimeRNN(rnn_Wx, rnn_Wh, rnn_b, stateful=True),
TimeAffine(affine_W, affine_b)
]
self.loss_layer = TimeSoftmaxWithLoss()
self.rnn_layer = self.layers[1]
self.params, self.grads = [], []
for layer in self.layers:
self.params += layer.params
self.grads += layer.grads
def forward(self, xs, ts):
for layer in self.layers:
xs = layer.forward(xs)
loss = self.loss_layer.forward(xs, ts)
return loss
def backward(self, dout=1):
dout = self.loss_layer.backward(dout)
for layer in reversed(self.layers):
dout = layer.backward(dout)
return dout
def reset_state(self):
self.rnn_layer.reset_state()
5.5.2语言模型的评价
困惑度表示概率的倒数。困惑度越小越好。
分叉度是指下一个可以选择的选项的数量。
对于语言模型的评价,困惑度越小,分叉都越小,表明模型越好。
5.5.3RNNLM的学习代码
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for i in range(len(params)):
params[i] -= self.lr * grads[i]
batch_size = 10
wordvec_size = 100
hidden_size = 100
time_size = 5
lr = 0.1
max_epoch = 100
corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)
xs = corpus[:-1]
ts = corpus[1:]
data_size = len(xs)
print('corpus size: %d, vocabulary size: %d' % (corpus_size, vocab_size))
max_iters = data_size // (batch_size * time_size)
time_idx = 0
total_loss = 0
loss_count = 0
ppl_list = []
model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
jump = (corpus_size - 1) // batch_size
offsets = [i * jump for i in range(batch_size)]
for epoch in range(max_epoch):
for iter in range(max_iters):
batch_x = np.empty((batch_size, time_size), dtype='i')
batch_t = np.empty((batch_size, time_size), dtype='i')
for t in range(time_size):
for i, offset in enumerate(offsets):
batch_x[i, t] = xs[(offset + time_idx) % data_size]
batch_t[i, t] = ts[(offset + time_idx) % data_size]
time_idx += 1
loss = model.forward(batch_x, batch_t)
model.backward()
optimizer.update(model.params, model.grads)
total_loss += loss
loss_count += 1
ppl = np.exp(total_loss / loss_count)
print('| epoch %d | perplexity %.2f'
% (epoch+1, ppl))
ppl_list.append(float(ppl))
total_loss, loss_count = 0, 0
5.5.4RNNLM的Trainer类
class RnnlmTrainer:
def __init__(self, model, optimizer):
self.model = model
self.optimizer = optimizer
self.time_idx = None
self.ppl_list = None
self.eval_interval = None
self.current_epoch = 0
def get_batch(self, x, t, batch_size, time_size):
batch_x = np.empty((batch_size, time_size), dtype='i')
batch_t = np.empty((batch_size, time_size), dtype='i')
data_size = len(x)
jump = data_size // batch_size
offsets = [i * jump for i in range(batch_size)]
for time in range(time_size):
for i, offset in enumerate(offsets):
batch_x[i, time] = x[(offset + self.time_idx) % data_size]
batch_t[i, time] = t[(offset + self.time_idx) % data_size]
self.time_idx += 1
return batch_x, batch_t
def fit(self, xs, ts, max_epoch=10, batch_size=20, time_size=35,
max_grad=None, eval_interval=20):
data_size = len(xs)
max_iters = data_size // (batch_size * time_size)
self.time_idx = 0
self.ppl_list = []
self.eval_interval = eval_interval
model, optimizer = self.model, self.optimizer
total_loss = 0
loss_count = 0
start_time = time.time()
for epoch in range(max_epoch):
for iters in range(max_iters):
batch_x, batch_t = self.get_batch(xs, ts, batch_size, time_size)
loss = model.forward(batch_x, batch_t)
model.backward()
params, grads = remove_duplicate(model.params, model.grads)
if max_grad is not None:
clip_grads(grads, max_grad)
optimizer.update(params, grads)
total_loss += loss
loss_count += 1
if (eval_interval is not None) and (iters % eval_interval) == 0:
ppl = np.exp(total_loss / loss_count)
elapsed_time = time.time() - start_time
print('| epoch %d | iter %d / %d | time %d[s] | perplexity %.2f'
% (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, ppl))
self.ppl_list.append(float(ppl))
total_loss, loss_count = 0, 0
self.current_epoch += 1
batch_size = 10
wordvec_size = 100
hidden_size = 100
time_size = 5
lr = 0.1
max_epoch = 100
corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)
xs = corpus[:-1]
ts = corpus[1:]
model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)
trainer.fit(xs, ts, max_epoch, batch_size, time_size)
5.6 小节
本章介绍了RNN、Time RNN和RNNLM。学习重点是RNN的原理及对循环结构的理解。
六 Gated RNN
RNN无法很好的学习到时序数据的长期依赖关系。
LSTM和GRU的门结构可以学习到时序数据的长期依赖关系。
6.1 RNN的问题
BPTT会发生梯度消失和梯度爆炸的问题。
6.1.1 RNN的复习
复习了RNN的结构图和计算图。
6.1.2 梯度消失和梯度爆炸
随着时间的回溯,简单RNN未能避免梯度变小或梯度变大。
6.1.3 梯度消失和梯度爆炸的原因
产生梯度消失的原因是激活函数tanh,换成ReLU可抑制这种梯度消失。
L2范数是均方误差。
N = 2
H = 3
T = 20
dh = np.ones((N, H))
np.random.seed(3)
Wh = np.random.randn(H, H)
#Wh = np.random.randn(H, H) * 0.5
norm_list = []
for t in range(T):
dh = np.dot(dh, Wh.T)
norm = np.sqrt(np.sum(dh**2)) / N
norm_list.append(norm)
print(norm_list)
输出为:[2.4684068094579303, 3.3357049741610365, 4.783279375373182, 6.279587332087612, 8.080776465019053, 10.251163032292936, 12.936063506609896, 16.276861327786712, 20.45482961834598, 25.688972842084684, 32.25315718048336, 40.48895641683869, 50.8244073070191, 63.79612654485427, 80.07737014308985, 100.5129892205125, 126.16331847536823, 158.35920648258823, 198.7710796761195, 249.495615421267]
这里输出的梯度随时间呈指数级增加,因此产生梯度爆炸。
N = 2
H = 3
T = 20
dh = np.ones((N, H))
np.random.seed(3)
#Wh = np.random.randn(H, H)
Wh = np.random.randn(H, H) * 0.5
norm_list = []
for t in range(T):
dh = np.dot(dh, Wh.T)
norm = np.sqrt(np.sum(dh**2)) / N
norm_list.append(norm)
print(norm_list)
输出为:[1.2342034047289652, 0.8339262435402591, 0.5979099219216477, 0.39247420825547574, 0.2525242645318454, 0.16017442237957713, 0.10106299614538981, 0.06358148956166684, 0.03995083909833199, 0.025086887541098325, 0.015748611904532892, 0.009884999125204758, 0.006204151282595105, 0.003893806551809953, 0.002443767399386287, 0.0015337065005571365, 0.0009625497320203265, 0.0006040924319556741, 0.00037912574706291106, 0.00023793756048323344]
这里输出的梯度随时间呈指数级减小,因此产生梯度消失。
矩阵Wh被反复乘了T次,奇异值表示数据的离散程度。奇异值>1,则梯度爆炸;奇异值<1,则梯度消失。
6.1.4 梯度爆炸的对策
要进行threshold,再据它做梯度裁剪。
import numpy as np
dW1 = np.random.rand(3, 3) * 10
dW2 = np.random.rand(3, 3) * 10
grads = [dW1, dW2]
max_norm = 5.0
def clip_grads(grads, max_norm):
total_norm = 0
for grad in grads:
total_norm += np.sum(grad ** 2)
total_norm = np.sqrt(total_norm)
rate = max_norm / (total_norm + 1e-6)
if rate < 1:
for grad in grads:
grad *= rate
print('before:', dW1.flatten())
clip_grads(grads, max_norm)
print('after:', dW1.flatten())
输出为:before: [6.49144048 2.78487283 6.76254902 5.90862817 0.23981882 5.58854088
2.59252447 4.15101197 2.83525082]
after: [1.49503731 0.64138134 1.55747605 1.36081038 0.05523244 1.28709139
0.59708178 0.95601551 0.65298384]
输入梯度和阈值后,输出为裁剪了的梯度。
梯度裁剪能解决梯度爆炸的问题。
6.2 梯度消失和LSTM
6.2.1 LSTM的接口
LSTM和RNN的不同在于多了记忆单元c(记忆单元),其特点是仅在LSTM层内部接受和传递数据。
6.2.2 LSTM层的结构
仍在每层输出隐藏状态,记忆单元只在层与层之间传递,但不输出。
ct和ht的数量相等。
门的开合程度也是从数据中学习的。
6.2.3 输出门
这里考虑对tanh(ct)施加门。针对tanh(ct)的各个元素,调整它们作为下一时刻的隐藏状态的重要程度。这个门称为输出门。输出门的开合程度根据输入xt和上一个状态ht-1求出。
tanh的值域在-1到1,表示某种被编码信息的强弱。sigmoid的值域在0到1,表示数据流出的比例。
6.2.4 遗忘门
在记忆单元ct-1上添加一个忘记不必要记忆的门,就是遗忘门。
6.2.5 新的记忆单元
向记忆单元添加一些应当记住的新信息,添加新的tanh节点。
6.2.6 输入门
给g添加的门称为输出门。
6.2.7 LSTM的梯度的流动
LSTM不会引起梯度消失(爆炸)的原因:反向传播不进行矩阵乘积运算,而是进行元素乘积运算(阿达玛积)。
LSTM是可以长时间维持的短期记忆。
6.3 LSTM的实现
LSTM和RNN实现的不同是多了记忆单元:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
class LSTM:
def __init__(self, Wx, Wh, b):
self.params = [Wx, Wh, b]
self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
self.cache = None
def forward(self, x, h_prev, c_prev):
Wx, Wh, b = self.params
N, H = h_prev.shape
A = np.dot(x, Wx) + np.dot(h_prev, Wh) + b
f = A[:, :H]
g = A[:, H:2*H]
i = A[:, 2*H:3*H]
o = A[:, 3*H:]
f = sigmoid(f)
g = np.tanh(g)
i = sigmoid(i)
o = sigmoid(o)
c_next = f * c_prev + g * i
h_next = o * np.tanh(c_next)
self.cache = (x, h_prev, c_prev, i, f, g, o, c_next)
return h_next, c_next
def backward(self, dh_next, dc_next):
Wx, Wh, b = self.params
x, h_prev, c_prev, i, f, g, o, c_next = self.cache
tanh_c_next = np.tanh(c_next)
ds = dc_next + (dh_next * o) * (1 - tanh_c_next ** 2)
dc_prev = ds * f
di = ds * g
df = ds * c_prev
do = dh_next * tanh_c_next
dg = ds * i
di *= i * (1 - i)
df *= f * (1 - f)
do *= o * (1 - o)
dg *= (1 - g ** 2)
dA = np.hstack((df, dg, di, do))
dWh = np.dot(h_prev.T, dA)
dWx = np.dot(x.T, dA)
db = dA.sum(axis=0)
self.grads[0][...] = dWx
self.grads[1][...] = dWh
self.grads[2][...] = db
dx = np.dot(dA, Wx.T)
dh_prev = np.dot(dA, Wh.T)
return dx, dh_prev, dc_prev
TimeLSTM由t个LSTM组成:
class TimeLSTM:
def __init__(self, Wx, Wh, b, stateful=False):
self.params = [Wx, Wh, b]
self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
self.layers = None
self.h, self.c = None, None
self.dh = None
self.stateful = stateful
def forward(self, xs):
Wx, Wh, b = self.params
N, T, D = xs.shape
H = Wh.shape[0]
self.layers = []
hs = np.empty((N, T, H), dtype='f')
if not self.stateful or self.h is None:
self.h = np.zeros((N, H), dtype='f')
if not self.stateful or self.c is None:
self.c = np.zeros((N, H), dtype='f')
for t in range(T):
layer = LSTM(*self.params)
self.h, self.c = layer.forward(xs[:, t, :], self.h, self.c)
hs[:, t, :] = self.h
self.layers.append(layer)
return hs
def backward(self, dhs):
Wx, Wh, b = self.params
N, T, H = dhs.shape
D = Wx.shape[0]
dxs = np.empty((N, T, D), dtype='f')
dh, dc = 0, 0
grads = [0, 0, 0]
for t in reversed(range(T)):
layer = self.layers[t]
dx, dh, dc = layer.backward(dhs[:, t, :] + dh, dc)
dxs[:, t, :] = dx
for i, grad in enumerate(layer.grads):
grads[i] += grad
for i, grad in enumerate(grads):
self.grads[i][...] = grad
self.dh = dh
return dxs
def set_state(self, h, c=None):
self.h, self.c = h, c
def reset_state(self):
self.h, self.c = None, None
6.4 使用LSTM的语言模型
import os
def to_cpu(x):
import numpy
if type(x) == numpy.ndarray:
return x
return np.asnumpy(x)
def to_gpu(x):
import cupy
if type(x) == cupy.ndarray:
return x
return cupy.asarray(x)
class BaseModel:
def __init__(self):
self.params, self.grads = None, None
def forward(self, *args):
raise NotImplementedError
def backward(self, *args):
raise NotImplementedError
def save_params(self, file_name=None):
if file_name is None:
file_name = self.__class__.__name__ + '.pkl'
params = [p.astype(np.float16) for p in self.params]
if GPU:
params = [to_cpu(p) for p in params]
with open(file_name, 'wb') as f:
pickle.dump(params, f)
def load_params(self, file_name=None):
if file_name is None:
file_name = self.__class__.__name__ + '.pkl'
if '/' in file_name:
file_name = file_name.replace('/', os.sep)
if not os.path.exists(file_name):
raise IOError('No file: ' + file_name)
with open(file_name, 'rb') as f:
params = pickle.load(f)
params = [p.astype('f') for p in params]
if GPU:
params = [to_gpu(p) for p in params]
for i, param in enumerate(self.params):
param[...] = params[i]
class Rnnlm(BaseModel):
def __init__(self, vocab_size=10000, wordvec_size=100, hidden_size=100):
V, D, H = vocab_size, wordvec_size, hidden_size
rn = np.random.randn
embed_W = (rn(V, D) / 100).astype('f')
lstm_Wx = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
lstm_Wh = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
lstm_b = np.zeros(4 * H).astype('f')
affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
affine_b = np.zeros(V).astype('f')
self.layers = [
TimeEmbedding(embed_W),
TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True),
TimeAffine(affine_W, affine_b)
]
self.loss_layer = TimeSoftmaxWithLoss()
self.lstm_layer = self.layers[1]
self.params, self.grads = [], []
for layer in self.layers:
self.params += layer.params
self.grads += layer.grads
def predict(self, xs):
for layer in self.layers:
xs = layer.forward(xs)
return xs
def forward(self, xs, ts):
score = self.predict(xs)
loss = self.loss_layer.forward(score, ts)
return loss
def backward(self, dout=1):
dout = self.loss_layer.backward(dout)
for layer in reversed(self.layers):
dout = layer.backward(dout)
return dout
def reset_state(self):
self.lstm_layer.reset_state()
import sys
def eval_perplexity(model, corpus, batch_size=10, time_size=35):
print('evaluating perplexity ...')
corpus_size = len(corpus)
total_loss = 0
max_iters = (corpus_size - 1) // (batch_size * time_size)
jump = (corpus_size - 1) // batch_size
for iters in range(max_iters):
xs = np.zeros((batch_size, time_size), dtype=np.int32)
ts = np.zeros((batch_size, time_size), dtype=np.int32)
time_offset = iters * time_size
offsets = [time_offset + (i * jump) for i in range(batch_size)]
for t in range(time_size):
for i, offset in enumerate(offsets):
xs[i, t] = corpus[(offset + t) % corpus_size]
ts[i, t] = corpus[(offset + t + 1) % corpus_size]
try:
loss = model.forward(xs, ts, train_flg=False)
except TypeError:
loss = model.forward(xs, ts)
total_loss += loss
sys.stdout.write('\r%d / %d' % (iters, max_iters))
sys.stdout.flush()
print('')
ppl = np.exp(total_loss / max_iters)
return ppl
batch_size = 20
wordvec_size = 100
hidden_size = 100
time_size = 35
lr = 20.0
max_epoch = 4
max_grad = 0.25
corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_test, _, _ = ptb.load_data('test')
vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]
model = Rnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)
trainer.fit(xs, ts, max_epoch, batch_size, time_size, max_grad,
eval_interval=20)
trainer.plot(ylim=(0, 500))
model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)
model.save_params()
eval_perplexity是评价困惑度的指标,随着训练轮次的增加,困惑度下降。
6.5 进一步改进RNNLM
6.5.1 LSTM层的多层化
需要构建高精度模型的时候,加深LSTM层是有效方法,不过层数也是超参数。
6.5.2 基于Dropout抑制过拟合
RNN比常规的前馈神经网络更容易发生过拟合。
抑制过拟合的三种方式:1增加训练数据 2降低模型复杂度 3对模型复杂度给予惩罚的正则化。
dropout属于第三种,它随机选择一部分神经元,然后忽略他们,可以提高神经网络的泛化能力。
对于LSTM,dropout不应插在时序方向上,而应插在深度方向上。
mask指的是决定是否传递数据的随即布尔值。
6.5.3 权重共享
使用权重共享只需将embedding层权重转置设置为affine层的权重。
共享权重可以减少需要学习的参数量。
6.5.4 更好的RNNLM的实现
class TimeDropout:
def __init__(self, dropout_ratio=0.5):
self.params, self.grads = [], []
self.dropout_ratio = dropout_ratio
self.mask = None
self.train_flg = True
def forward(self, xs):
if self.train_flg:
flg = np.random.rand(*xs.shape) > self.dropout_ratio
scale = 1 / (1.0 - self.dropout_ratio)
self.mask = flg.astype(np.float32) * scale
return xs * self.mask
else:
return xs
def backward(self, dout):
return dout * self.mask
class BetterRnnlm(BaseModel):
def __init__(self, vocab_size=10000, wordvec_size=650,
hidden_size=650, dropout_ratio=0.5):
V, D, H = vocab_size, wordvec_size, hidden_size
rn = np.random.randn
embed_W = (rn(V, D) / 100).astype('f')
lstm_Wx1 = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
lstm_Wh1 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
lstm_b1 = np.zeros(4 * H).astype('f')
lstm_Wx2 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
lstm_Wh2 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
lstm_b2 = np.zeros(4 * H).astype('f')
affine_b = np.zeros(V).astype('f')
self.layers = [
TimeEmbedding(embed_W),
TimeDropout(dropout_ratio),
TimeLSTM(lstm_Wx1, lstm_Wh1, lstm_b1, stateful=True),
TimeDropout(dropout_ratio),
TimeLSTM(lstm_Wx2, lstm_Wh2, lstm_b2, stateful=True),
TimeDropout(dropout_ratio),
TimeAffine(embed_W.T, affine_b) # weight tying!!
]
self.loss_layer = TimeSoftmaxWithLoss()
self.lstm_layers = [self.layers[2], self.layers[4]]
self.drop_layers = [self.layers[1], self.layers[3], self.layers[5]]
self.params, self.grads = [], []
for layer in self.layers:
self.params += layer.params
self.grads += layer.grads
def predict(self, xs, train_flg=False):
for layer in self.drop_layers:
layer.train_flg = train_flg
for layer in self.layers:
xs = layer.forward(xs)
return xs
def forward(self, xs, ts, train_flg=True):
score = self.predict(xs, train_flg)
loss = self.loss_layer.forward(score, ts)
return loss
def backward(self, dout=1):
dout = self.loss_layer.backward(dout)
for layer in reversed(self.layers):
dout = layer.backward(dout)
return dout
def reset_state(self):
for layer in self.lstm_layers:
layer.reset_state()
batch_size = 20
wordvec_size = 650
hidden_size = 650
time_size = 35
lr = 20.0
max_epoch = 40
max_grad = 0.25
dropout = 0.5
corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_val, _, _ = ptb.load_data('val')
corpus_test, _, _ = ptb.load_data('test')
if config.GPU:
corpus = to_gpu(corpus)
corpus_val = to_gpu(corpus_val)
corpus_test = to_gpu(corpus_test)
vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]
model = BetterRnnlm(vocab_size, wordvec_size, hidden_size, dropout)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)
best_ppl = float('inf')
for epoch in range(max_epoch):
trainer.fit(xs, ts, max_epoch=1, batch_size=batch_size,
time_size=time_size, max_grad=max_grad)
model.reset_state()
ppl = eval_perplexity(model, corpus_val)
print('valid perplexity: ', ppl)
if best_ppl > ppl:
best_ppl = ppl
model.save_params()
else:
lr /= 4.0
optimizer.lr = lr
model.reset_state()
print('-' * 50)
model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)
以上代码将上面三个技巧(加深层,dropout,权重共享)进行了实现和在PTB数据上的实战。
6.6 小结
RNN常存在梯度消失和爆炸的问题。
梯度裁剪可以应对梯度爆炸。LSTM、GRU可以应对梯度消失。
相较于RNN,LSTM多了门的机制。LSTM中有三个门:输入门、遗忘门和输出门。门有专门的权重。
更好的语言模型的三个技巧(加深层,dropout和权重共享)得到了实现。