读书笔记:深度学习进阶-自然语言处理(俗称鱼书二)

文章目录


前言

还记得半年前开始看并给鱼书一做笔记的热情,现在想进一步了,作者和翻译作者都非常高效给力了,那么快就出了第二本鱼书-《深度学习进阶-自然语言处理》,本书一共八章,希望每周可以看完并完成一个章节的笔记,深度学习冲冲冲~
在啃第一本鱼书的时候,是采用了先看在总结的方式,这次看第二本决定边看边总结,看完一遍可能还得回去再看第二遍,希望这次边学边思的效率高些,加油加油!


一、神经网络的复习

这一张主要是对上一本书内容的总结,复习可戳https://blog.csdn.net/qq_45461377/article/details/124915144?spm=1001.2014.3001.5501

二、自然语言和单词的分布式表示

2.1什么是自然语言处理

自然语言处理是让计算机处理自然语言的技术。自然语言是柔软的,而计算机语言(编程)是硬的,让计算机读懂并处理自然语言的技术目前主要分为三类:1、基于同义词词典的方法;2、基于计数的方法;3基于推理的方法(word2vec)。本章将先介绍前两种。

2.2同义词词典

同义词词典是人为定义的单词之间粒度关系的词典,把意思相近的单词放得更近,同时表示单词之间上下、左右关系的词典。
最著名的同义词词典是WordNet。
同义词词典存在的问题:1、难以顺应时代变化;2、人工成本高;3、无法表示单词的微妙差异。

2.3基于计数的方法

基于计数的方法使用语料库来处理自然语言。

2.3.1基于python的语料库的预处理

text='You say goodbye and I say hello.'
text=text.lower()
text=text.replace('.',' .')
print(text)

输出:you say goodbye and i say hello .

words=text.split(' ')
print(words)

输出:[‘you’, ‘say’, ‘goodbye’, ‘and’, ‘i’, ‘say’, ‘hello’, ‘.’]

word_to_id={}
id_to_word={}

for word in words:
    if word not in word_to_id:
        new_id=len(word_to_id)
        word_to_id[word]=new_id
        id_to_word[new_id]=word

print(id_to_word)
print(word_to_id)

输出:{0: ‘you’, 1: ‘say’, 2: ‘goodbye’, 3: ‘and’, 4: ‘i’, 5: ‘hello’, 6: ‘.’}
{‘you’: 0, ‘say’: 1, ‘goodbye’: 2, ‘and’: 3, ‘i’: 4, ‘hello’: 5, ‘.’: 6}

print(id_to_word[1])
print(word_to_id['hello'])

输出:say
5

import numpy as np
corpus=[word_to_id[w] for w in words]
corpus=np.array(corpus)
print(corpus)

输出:[0 1 2 3 4 1 5 6]
把上面的部分合成一个函数得:

def preprocess(text):
    text=text.lower()
    text=text.replace('.',' .')
    word_to_id={}
    id_to_word={}

    for word in words:
        if word not in word_to_id:
            new_id=len(word_to_id)
            word_to_id[word]=new_id
            id_to_word[new_id]=word
    
    corpus=[word_to_id[w] for w in words]

    return corpus, word_to_id, id_to_word

text='You say goodbye and I say hello.'
corpus, word_to_id, id_to_word=preprocess(text)

2.3.2单词的分布式表示

将颜色表示类比到单词表示,颜色可以用向量来表示,同样地,单词也可以。

2.3.3分布式假设

分布式假设指某个单词的含义由他周围的单词决定。
上下文指某个单词周围的单词。
窗口大小指周围的单词有多少个。

2.3.4共现矩阵

print(corpus)
print(id_to_word)

输出:[0, 1, 2, 3, 4, 1, 5, 6]
{0: ‘you’, 1: ‘say’, 2: ‘goodbye’, 3: ‘and’, 4: ‘i’, 5: ‘hello’, 6: ‘.’}

def create_co_matrix(corpus,vocab_size,window_size=1):
    corpus_size=len(corpus)
    co_matrix=np.zeros((vocab_size,vocab_size),dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1,window_size+1):
            left_idx=idx-i
            right_idx=idx+i

            if left_idx>=0:
                left_word_id=corpus[left_idx]
                co_matrix[word_id,left_word_id]+=1

            if right_idx<corpus_size:
                right_word_id=corpus[right_idx]
                co_matrix[word_id,right_word_id]+=1
    return co_matrix
print(create_co_matrix(corpus,vocab_size=len(word_to_id)))

输出:[[0 1 0 0 0 0 0]
[1 0 1 0 1 1 0]
[0 1 0 1 0 0 0]
[0 0 1 0 1 0 0]
[0 1 0 1 0 0 0]
[0 1 0 0 0 0 1]
[0 0 0 0 0 1 0]]

2.3.5向量间的相似度

余弦相似度表示两个向量在多大程度上指向同一方向。
L2范数指向量个元素的平方和的平方根。

def cos_similarity(x,y):
    nx=x/np.sqrt(np.sum(x**2))
    ny=y/np.sqrt(np.sum(y**2))
    return np.dot(nx,ny)
def cos_similarity(x,y,eps=1e-8):
    nx=x/np.sqrt(np.sum(x**2)+eps)
    ny=y/np.sqrt(np.sum(y**2)+eps)
    return np.dot(nx,ny)
C=create_co_matrix(corpus,vocab_size=len(word_to_id))

c0=C[word_to_id['you']]
c1=C[word_to_id['i']]
print(cos_similarity(c0,c1))

输出:0.7071067758832467

2.3.6相似单词的排序

定义一个查询词相似的单词按降序显示出来的函数。

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    # 取出查询词
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    # 计算余弦相似度
    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)
    
    # 基于余弦相似度,按降序输出值
    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return
x=np.array([100,-20,2])
x.argsort()
(-x).argsort()

输出:array([0, 2, 1])

most_similar('you', word_to_id, id_to_word, C)

输出为
[query] you
goodbye: 0.7071067758832467
i: 0.7071067758832467
hello: 0.7071067758832467
say: 0.0
and: 0.0

2.4基于计数的方法的改进

基于计数的方法使用语料库来处理自然语言。

2.4.1点互信息

点互信息PMI,用来表示词之间的相关性。
正的点互信息PPMI,为解决共出现为0的情况。

def ppmi(C,verbose=False,eps=1e-8):
    M=np.zeros_like(C,dtype=np.float32)
    N=np.sum(C)
    S=np.sum(C,axis=0)
    total=C.shape[0]*C.shape[1]
    cnt=0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi=np.log2(C[i,j]*N/S[j]*S[i]+eps)
            M[i,j]=max(0,pmi)

            if verbose:
                cnt+=1
                if cnt%(total//100+1)==0:
                    print('%.1f%% done'%(100*cnt/total))
    
    return M
W=ppmi(C)
print(W)

输出:[[0. 1.8073549 0. 0. 0. 0. 0. ]
[5.807355 0. 4.807355 0. 4.807355 4.807355 0. ]
[0. 2.807355 0. 3.807355 0. 0. 0. ]
[0. 0. 3.807355 0. 3.807355 0. 0. ]
[0. 2.807355 0. 3.807355 0. 0. 0. ]
[0. 2.807355 0. 0. 0. 0. 4.807355 ]
[0. 0. 0. 0. 0. 2.807355 0. ]]
PPMI矩阵的问题,随着语料库词汇量的增加,各单词向量的维度也会增加。

2.4.2降维

降维指保留重要信息,降低向量维度。
降维可以将稀疏矩阵转化为密集矩阵。
本书举例奇异值分解法(SVD),其可将任意矩阵转化为3个矩阵的乘积。

2.4.3基于SVD的降维

U,S,V=np.linalg.svd(W)
print(C[0]) #共现矩阵
print(W[0]) #PPMI矩阵
print(U[0]) #SVD矩阵
print(U[0,:2]) #降维为2

输出:[0 1 0 0 0 0 0]
[0. 1.8073549 0. 0. 0. 0. 0. ]
[ 0.0000000e+00 1.7478172e-01 3.8205594e-02 -1.1102230e-16
-1.1102230e-16 -9.8386568e-01 -1.2717195e-16]
[0. 0.17478172]

import matplotlib.pyplot as plt
for word, word_id in word_to_id.items():
    plt.annotate(word, (U[word_id,0], U[word_id,1]))

plt.scatter(U[:,0],U[:,1],alpha=0.5)
plt.show()

2.4.4PBT数据集

from dataset import ptb

corpus,word_to_id,id_to_word=ptb.load_data('train')
print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['car']:", word_to_id['car'])
print("word_to_id['happy']:", word_to_id['happy'])
print("word_to_id['lexus']:", word_to_id['lexus'])

输出:corpus size: 929589
corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]

id_to_word[0]: aer
id_to_word[1]: banknote
id_to_word[2]: berlitz

word_to_id[‘car’]: 3856
word_to_id[‘happy’]: 4428
word_to_id[‘lexus’]: 7426

2.4.5基于PTB数据集的评价

window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
print('counting  co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)

print('calculating SVD ...')
try:
    # truncated SVD (fast!)
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
                             random_state=None)
except ImportError:
    # SVD (slow)
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]

querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

输出:Output exceeds the size limit. Open the full output data in a text editor
counting co-occurrence …
calculating PPMI …
1.0% done
2.0% done
3.0% done
4.0% done
5.0% done
6.0% done
7.0% done
8.0% done
9.0% done
10.0% done
11.0% done
12.0% done
13.0% done
14.0% done
15.0% done
16.0% done
17.0% done
18.0% done
19.0% done
20.0% done
21.0% done
22.0% done
23.0% done

motor: 0.7355839014053345
squibb: 0.6746079921722412
transmission: 0.6652272343635559
eli: 0.6599057912826538

2.5小结

本章我们重点学习了单词的向量表示的发展,从预处理到共现矩阵,再到表示单词间关系的PPMI矩阵,最后到将为的SVD矩阵,学会处理词向量,是这个部分代码所重点强调的。

三、word2vec

3.1基于推理的方法和神经网络

上一章我们说到表示词向量流行的方法有两种:基于计数的方法和基于推理的方法,两者都是基于分布式假设,这一章就来介绍基于推理的方法。

3.1.1基于计数的方法的问题

消耗大量的计算资源和时间。
基于计数的方法一次性处理全部学习数据,而基于推理的方法使用部分学习数据逐步学习。

3.1.2基于推理的方法概要

分布式假设:单词含义由周围单词构成。

3.1.3神经网络中单词的处理方法

通过one-hot矩阵进行表示,再进入Matmul层。

import numpy as np

c=np.array([[1,0,0,0,0,0,0]]) #输入
W=np.random.randn(7,3) #权重
h=np.dot(c,W) #中间节点

print(h)

输出:[[-0.70770067 -0.12952097 0.2498067 ]]

class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx

layer=MatMul(W)
h=layer.forward(c)
print(h)

输出为:[[-0.70770067 -0.12952097 0.2498067 ]]
两种方式得到的结果一样,但为了后面模型的构建,采用第二种MatMul层的方法。

3.2简单的word2vec

宜CBOW模型为例进行本章的学习。

3.2.1CBOW模型的推理

CBOW模型是根据上下文预测目标词的神经网络,逻辑是 输入层(编码)中间层(解码)输出层。

# 样本的上下文数据
c0 = np.array([[1, 0, 0, 0, 0, 0, 0]])
c1 = np.array([[0, 0, 1, 0, 0, 0, 0]])

# 权重初始化
W_in = np.random.randn(7, 3)
W_out = np.random.randn(3, 7)

# 生成层
in_layer0 = MatMul(W_in)
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)

# 正向传播
h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)
s = out_layer.forward(h)

print(s)

输出为:[[-1.65562246 1.56105405 -0.34916877 1.04647459 2.04680116 -1.805466
0.90635499]]

3.2.2CBOW模型的学习

模型是个多类别分类的神经网络,用softmax函数和交叉熵误差。

3.2.3word2vec的权重和分布式表示

word2vec用输入层权重Win作为最终的单词分布式表示。

3.3学习数据的准备

3.3.1上下文和目标词

当向神经网络输入上下文时,使目标词出现的概率高

def create_contexts_target(corpus, window_size=1):

    target = corpus[window_size:-window_size]
    contexts = []

    for idx in range(window_size, len(corpus)-window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        contexts.append(cs)

    return np.array(contexts), np.array(target)
contexts,target=create_contexts_target(corpus, window_size=1)

print(contexts)
print(target)

输出为:[[0 2]
[1 3]
[2 4]
[3 1]
[4 5]
[1 6]]
[1 2 3 4 1 5]

3.3.2转化为one-hot表示

convert_one_hot()函数将上下文和目标词转化成了one-hot。

def convert_one_hot(corpus, vocab_size):

    N = corpus.shape[0]

    if corpus.ndim == 1:
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1

    elif corpus.ndim == 2:
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1

    return one_hot
vocab_size=len(word_to_id)
contexts=convert_one_hot(contexts,vocab_size)
target=convert_one_hot(target,vocab_size)

3.4CBOW模型的实现

def softmax(x):
    if x.ndim == 2:
        x = x - x.max(axis=1, keepdims=True)
        x = np.exp(x)
        x /= x.sum(axis=1, keepdims=True)
    elif x.ndim == 1:
        x = x - np.max(x)
        x = np.exp(x) / np.sum(np.exp(x))

    return x


def cross_entropy_error(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
        
    if t.size == y.size:
        t = t.argmax(axis=1)
             
    batch_size = y.shape[0]

    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size

    
class SoftmaxWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.y = None  
        self.t = None  

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)

        if self.t.size == self.y.size:
            self.t = self.t.argmax(axis=1)

        loss = cross_entropy_error(self.y, self.t)
        return loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = self.y.copy()
        dx[np.arange(batch_size), self.t] -= 1
        dx *= dout
        dx = dx / batch_size

        return dx
    
class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size

        # 初始化权重
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        # 生成层
        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss()

        # 将所有的权重和梯度整理到列表中
        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        # 将单词的分布式表示设置为成员变量
        self.word_vecs = W_in

    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:, 0])
        h1 = self.in_layer1.forward(contexts[:, 1])
        h = (h0 + h1) * 0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss

    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da *= 0.5
        self.in_layer1.backward(da)
        self.in_layer0.backward(da)
        return None
class Adam:
    '''
    Adam (http://arxiv.org/abs/1412.6980v8)
    '''
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = [], []
            for param in params:
                self.m.append(np.zeros_like(param))
                self.v.append(np.zeros_like(param))
        
        self.iter += 1
        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)

        for i in range(len(params)):
            self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
            self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
            
            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)
import numpy
import time
import matplotlib.pyplot as plt
import numpy as np

def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate
def remove_duplicate(params, grads):

    params, grads = params[:], grads[:]  # copy list

    while True:
        find_flg = False
        L = len(params)

        for i in range(0, L - 1):
            for j in range(i + 1, L):

                if params[i] is params[j]:
                    grads[i] += grads[j]  
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                
                elif params[i].ndim == 2 and params[j].ndim == 2 and \
                     params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
                    grads[i] += grads[j].T
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)

                if find_flg: break
            if find_flg: break

        if not find_flg: break

    return params, grads

class Trainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.loss_list = []
        self.eval_interval = None
        self.current_epoch = 0

    def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
        data_size = len(x)
        max_iters = data_size // batch_size
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            
            idx = numpy.random.permutation(numpy.arange(data_size))
            x = x[idx]
            t = t[idx]

            for iters in range(max_iters):
                batch_x = x[iters*batch_size:(iters+1)*batch_size]
                batch_t = t[iters*batch_size:(iters+1)*batch_size]

                loss = model.forward(batch_x, batch_t)
                model.backward()
                params, grads = remove_duplicate(model.params, model.grads) 
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    avg_loss = total_loss / loss_count
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | loss %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, avg_loss))
                    self.loss_list.append(float(avg_loss))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

    def plot(self, ylim=None):
        x = numpy.arange(len(self.loss_list))
        if ylim is not None:
            plt.ylim(*ylim)
        plt.plot(x, self.loss_list, label='train')
        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
        plt.ylabel('loss')
        plt.show()
window_size = 1
hidden_size = 5
batch_size = 3
max_epoch = 1000

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
contexts, target = create_contexts_target(corpus, window_size)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

model = SimpleCBOW(vocab_size, hidden_size)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

word_vecs = model.word_vecs
for word_id, word in id_to_word.items():
    print(word, word_vecs[word_id])

输出为:Output exceeds the size limit. Open the full output data in a text editor
| epoch 1 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 2 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 3 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 4 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 5 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 6 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 7 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 8 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 9 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 10 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 11 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 12 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 13 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 14 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 15 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 16 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 17 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 18 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 19 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 20 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 21 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 22 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 23 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 24 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 25 | iter 1 / 2 | time 0[s] | loss 1.94

and [-0.95682913 1.5921552 -0.9855902 -1.0069962 -1.570275 ]
i [0.82757306 0.49116576 0.8163651 0.8544265 0.8013826 ]
hello [0.9561777 1.4145606 0.95824456 0.97314036 1.1887612 ]
. [-1.0302204 -1.3671738 -0.98632777 -1.0349017 1.4386774 ]

3.5word2vec的补充说明

3.5.1CBOW模型和概率

后验概率就是数学中所学的条件概率。
注意CBOW模型中交叉熵损失的公式推导,学习任务是使之尽可能小。

3.5.2skip-gram模型

用目标词推上下文。
注意skip-gram模型中交叉熵损失的公式推导。
skip-gram的结果更好,CBOW更快。

3.5.3基于计数与基于推理

基于计数,靠相似性;基于推理,靠相似性和模式;两者结果难分上下,互通。

3.6小节

这一章我们学习了基于推理的词向量的表示方法——word2vec,以上下文推中间词的模型CBOW为例子,进行了代码的学习。对于以中间词推上下文的模型skip-gram也进行了原理方面的学习和了解。和基于计数的方法相比,基于推理的方法的进步表现在对数据模式的识别和表示,但其目前的计算速度方面还是存在很大的问题,下一章将对相关技术进行进一步的计算和讨论。

四、word2vec的高速化

上一章中CBOW模型的问题是随词汇量增加,计算量也会增加,导致时间成本高。本章加速会从两个方面进行介绍:1Embedding层的引入;2Negative Sampling的新损失函数的引入。

4.1word2vec的改进1

输入层的one-hot表示和权重矩阵Win的乘积->Embedding
中间层和权重矩阵Wout的乘积以及Softmax层的计算->Negative Sampling
本节介绍Embedding的引入。

4.1.1Embedding层

把MatMul层换成Embedding层,单词的密集向量表示称为词嵌入(Embedding)。

4.1.2Embedding层的实现

import numpy as np
W=np.arange(21).reshape(7,3)
print(W)

输出为:[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[15 16 17]
[18 19 20]]

print(W[2])

输出为:[6 7 8]

print(W[5])

输出为:[15 16 17]

idx=np.array([1,0,3,0])
print(W[idx])

输出为:[[ 3 4 5]
[ 0 1 2]
[ 9 10 11]
[ 0 1 2]]

class Embedding:
    def __init__(self,W):
        self.params=[W]
        self.grads=[np.zeros_like(W)]
        self.idx=None

    def forward(self,idx):
        W,=self.params
        self.idx=idx
        out=W[idx]
        return out
    
    def backward(self,dout):
        dW=self.grads
        dW[...]=0

        for i,word in enumerate(self.idx):
            dW[word_id]+=dout[i]
        return None

4.2word2vec的改进2

用负采样(Negative Sampling)替代Softmax,可以使得无论词汇量多大,计算量保持较低或恒定。

4.2.1中间层之后的计算问题

两个问题:
1、中间层的神经元和权重矩阵(Wout)的乘积
2、Softmax层的计算

4.2.2从多分类到二分类

负采样的中心思想是用二分类拟合多分类。
举例:让神经网络回答当上下文是you和goodbye时,目标词是say吗?
回答只需要提取say对应的列(单词向量),并用它与中间层的神经元计算内积即可。

4.2.3sigmoid函数和交叉熵误差

多分类:Softmax函数+交叉熵误差
二分类:Sigmoid函数+交叉熵误差
二分类和多分类交叉熵损失本质一样。

4.2.4从多分类到二分类的实现

1、输入层embedding替换了matmul
2、输出层embedding替换了matmul;sigmoid替换了softmax

class EmbeddingDot:
    def __init__(self, W):
        self.embed = Embedding(W)
        self.params = self.embed.params
        self.grads = self.embed.grads
        self.cache = None

    def forward(self, h, idx):
        target_W = self.embed.forward(idx)
        out = np.sum(target_W * h, axis=1)

        self.cache = (h, target_W)
        return out

    def backward(self, dout):
        h, target_W = self.cache
        dout = dout.reshape(dout.shape[0], 1)

        dtarget_W = dout * h
        self.embed.backward(dtarget_W)
        dh = dout * target_W
        return dh

4.2.5负采样

当前神经网络只是学习了正例say。
这一节的目标是负例。所有负例都值得学习吗?答案是否定的,只使用少数负例是负采样方法的含义。
当然,负采样也需要加上负例的损失。

4.2.6负采样的采样方法

考虑到计算复杂度,有必要将负例限定在较小范围内(5个或者10个)。

import numpy as np
np.random.choice(10)

输出为:3

words=['you','say','goodbye','I','hellow','.']
np.random.choice(words)

输出为:‘say’

np.random.choice(words,size=5)

输出为:array([‘hellow’, ‘.’, ‘hellow’, ‘I’, ‘I’], dtype=‘<U7’)

np.random.choice(words,size=5,replace=False)

输出为:array([‘.’, ‘say’, ‘goodbye’, ‘hellow’, ‘you’], dtype=‘<U7’)

p=[0.5,0.1,0.05,0.2,0.05,0.1]
np.random.choice(words,p=p)

输出为:‘you’

p=[0.7,0.29,0.01]
new_p=np.power(p,0.75)
new_p/=np.sum(new_p)
print(new_p)

输出为:[0.64196878 0.33150408 0.02652714]

import collections
class UnigramSampler:
    def __init__(self, corpus, power, sample_size):
        self.sample_size = sample_size
        self.vocab_size = None
        self.word_p = None

        counts = collections.Counter()
        for word_id in corpus:
            counts[word_id] += 1

        vocab_size = len(counts)
        self.vocab_size = vocab_size

        self.word_p = np.zeros(vocab_size)
        for i in range(vocab_size):
            self.word_p[i] = counts[i]

        self.word_p = np.power(self.word_p, power)
        self.word_p /= np.sum(self.word_p)

    def get_negative_sample(self, target):
        batch_size = target.shape[0]


        negative_sample = np.zeros((batch_size, self.sample_size), dtype=np.int32)

        for i in range(batch_size):
            p = self.word_p.copy()
            target_idx = target[i]
            p[target_idx] = 0
            p /= p.sum()
            negative_sample[i, :] = np.random.choice(self.vocab_size, size=self.sample_size, replace=False, p=p)

        return negative_sample
corpus=np.array([0,1,2,3,4,1,2,3])
power=0.75
sample_size=2

sampler=UnigramSampler(corpus,power,sample_size)
target=np.array([1,3,0])
negative_sample=sampler.get_negative_sample(target)
print(negative_sample)

输出为:[[4 3]
[2 1]
[4 3]]

4.2.7负采样的实现

class SigmoidWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.loss = None
        self.y = None  
        self.t = None  

    def forward(self, x, t):
        self.t = t
        self.y = 1 / (1 + np.exp(-x))

        self.loss = cross_entropy_error(np.c_[1 - self.y, self.y], self.t)

        return self.loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = (self.y - self.t) * dout / batch_size
        return dx

class NegativeSamplingLoss:
    def __init__(self, W, corpus, power=0.75, sample_size=5):
        self.sample_size = sample_size
        self.sampler = UnigramSampler(corpus, power, sample_size)
        self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
        self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]

        self.params, self.grads = [], []
        for layer in self.embed_dot_layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, h, target):
        batch_size = target.shape[0]
        negative_sample = self.sampler.get_negative_sample(target)

        score = self.embed_dot_layers[0].forward(h, target)
        correct_label = np.ones(batch_size, dtype=np.int32)
        loss = self.loss_layers[0].forward(score, correct_label)

        negative_label = np.zeros(batch_size, dtype=np.int32)
        for i in range(self.sample_size):
            negative_target = negative_sample[:, i]
            score = self.embed_dot_layers[1 + i].forward(h, negative_target)
            loss += self.loss_layers[1 + i].forward(score, negative_label)

        return loss

    def backward(self, dout=1):
        dh = 0
        for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
            dscore = l0.backward(dout)
            dh += l1.backward(dscore)

        return dh

4.3改进版word2vec的学习

4.3.1CBOW模型的实现

class CBOW:
    def __init__(self, vocab_size, hidden_size, window_size, corpus):
        V, H = vocab_size, hidden_size

        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(V, H).astype('f')

        self.in_layers = []
        for i in range(2 * window_size):
            layer = Embedding(W_in)  
            self.in_layers.append(layer)
        self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)

        layers = self.in_layers + [self.ns_loss]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        self.word_vecs = W_in

    def forward(self, contexts, target):
        h = 0
        for i, layer in enumerate(self.in_layers):
            h += layer.forward(contexts[:, i])
        h *= 1 / len(self.in_layers)
        loss = self.ns_loss.forward(h, target)
        return loss

    def backward(self, dout=1):
        dout = self.ns_loss.backward(dout)
        dout *= 1 / len(self.in_layers)
        for layer in self.in_layers:
            layer.backward(dout)
        return None

4.3.2CBOW模型的学习代码

import pickle
window_size = 3
hidden_size = 100
batch_size = 100
max_epoch = 10

corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)

contexts, target = create_contexts_target(corpus, window_size)

model = CBOW(vocab_size, hidden_size, window_size, corpus)
# model = SkipGram(vocab_size, hidden_size, window_size, corpus)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

word_vecs = model.word_vecs

params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'  # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

4.3.3CBOW模型的评价

def normalize(x):
    if x.ndim == 2:
        s = np.sqrt((x * x).sum(1))
        x /= s.reshape((s.shape[0], 1))
    elif x.ndim == 1:
        s = np.sqrt((x * x).sum())
        x /= s
    return x

def analogy(a, b, c, word_to_id, id_to_word, word_matrix, top=5, answer=None):
    for word in (a, b, c):
        if word not in word_to_id:
            print('%s is not found' % word)
            return

    print('\n[analogy] ' + a + ':' + b + ' = ' + c + ':?')
    a_vec, b_vec, c_vec = word_matrix[word_to_id[a]], word_matrix[word_to_id[b]], word_matrix[word_to_id[c]]
    query_vec = b_vec - a_vec + c_vec
    query_vec = normalize(query_vec)

    similarity = np.dot(word_matrix, query_vec)

    if answer is not None:
        print("==>" + answer + ":" + str(np.dot(word_matrix[word_to_id[answer]], query_vec)))

    count = 0
    for i in (-1 * similarity).argsort():
        if np.isnan(similarity[i]):
            continue
        if id_to_word[i] in (a, b, c):
            continue
        print(' {0}: {1}'.format(id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return
pkl_file = 'cbow_params.pkl'
# pkl_file = 'skipgram_params.pkl'

with open(pkl_file, 'rb') as f:
    params = pickle.load(f)
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

# most similar task
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

# analogy task
print('-'*50)
analogy('king', 'man', 'queen',  word_to_id, id_to_word, word_vecs)
analogy('take', 'took', 'go',  word_to_id, id_to_word, word_vecs)
analogy('car', 'cars', 'child',  word_to_id, id_to_word, word_vecs)
analogy('good', 'better', 'bad',  word_to_id, id_to_word, word_vecs)

4.4word2vec相关的其他话题

4.4.1word2vec的应用例

单词的分布式表示可以用于迁移学习,进而实现文本分类、文本聚类、词性标注和情感分析等自然语言处理任务。
单词的分布式表示的优点可以将单词转化为固定长度的向量。

4.4.2单词向量的评价方法

评价指标有相似度和类推问题。
相关结论:
1、模型不同,精度不同(根据语料库选择最佳的模型)。
2、语料库越大,结果越好(始终需要大数据)。
3、单词向量的维度必须适中。

4.5小结

本章主要针对CBOW的输入输出的高速化进行了优化。
针对输入部分,采用了embedding层。
针对输出部分,采用了负采样损失。
总体的思想是抽取,以实现高速化。

五、RNN

RNN是为了改进前馈网络而生的。

5.1概率和语言模型

5.1.1概率视角下的word2vec

这个部分复习了CBOW模型(由上下文预测中间词),引出了仅有左侧窗口上下文的概率和损失函数,进而引出语言模型。

5.1.2语言模型

语言模型给出了单词序列发生的概率,例子包括机器翻译、语音识别和生成新的句子。
多个时间一起发生的概率称为联合概率,联合概率可以转化为多个概率的乘积。
后验概率是以目标此左侧的全部单词为上下文时的概率。
图5-3表明了语言模型中的后验概率:若以t个单词为目标词,则第t个单词左侧的全部单词构成上下文(条件)。因此语言模型又被称作条件语言模型。

5.1.3将CBOW模型用作语言模型?

马尔可夫性指未来的状态仅依赖于当前的状态。
若用CBOW模型当语言模型会带来两个问题:1忽略了上下文单词顺序问题;2采用拼接的方式,参数量随上下文长度成比例增加。
而RNN能很好地解决这两个问题。它可以处理任意长度的时序数据。

5.2RNN

5.2.1循环的神经网络

循环是“反复并持续”的意思。
RNN一边记住过去的数据,一边更新到最新的数据。
RNN层拥有环路,输出中的一个分叉将成为其自身的输入。

5.2.2展开循环

多个RNN层都是同一层,各个时刻的RNN层接受传给该层的输入和前一个RNN的输出。
RNN是不断地进行输出和输入的(更新)。

5.2.3Backpropagation Through Time

BPTT是按时间顺序展开的神经网络的误差反向传播法。

5.2.4Truncated BPTT

截断的BPTT,只是网络的反向误差被截断,以被截出的网络为单位进行学习。
即正向传播不截断,反向传播为求误差而截断。

5.2.5Truncated BPTT的mini-batch学习

批数据。
注意:1要按顺序输入数据;2平一个批次输入数据的开始位置。

5.3RNN的实现

单步处理的层称为RNN层,一次处理T步的层称为Time RNN层。

5.3.1RNN层的实现

import numpy as np

class RNN:
    def __init__(self, Wx, Wh, b):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.cache = None

    def forward(self, x, h_prev):
        Wx, Wh, b = self.params
        t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b
        h_next = np.tanh(t)

        self.cache = (x, h_prev, h_next)
        return h_next

    def backward(self, dh_next):
        Wx, Wh, b = self.params
        x, h_prev, h_next = self.cache

        dt = dh_next * (1 - h_next ** 2)
        db = np.sum(dt, axis=0)
        dWh = np.dot(h_prev.T, dt)
        dh_prev = np.dot(dt, Wh.T)
        dWx = np.dot(x.T, dt)
        dx = np.dot(dt, Wx.T)

        self.grads[0][...] = dWx
        self.grads[1][...] = dWh
        self.grads[2][...] = db

        return dx, dh_prev

5.3.2Time RNN层的实现

class TimeRNN:
    def __init__(self, Wx, Wh, b, stateful=False):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.layers = None

        self.h, self.dh = None, None
        self.stateful = stateful

    def forward(self, xs):
        Wx, Wh, b = self.params
        N, T, D = xs.shape
        D, H = Wx.shape

        self.layers = []
        hs = np.empty((N, T, H), dtype='f')

        if not self.stateful or self.h is None:
            self.h = np.zeros((N, H), dtype='f')

        for t in range(T):
            layer = RNN(*self.params)
            self.h = layer.forward(xs[:, t, :], self.h)
            hs[:, t, :] = self.h
            self.layers.append(layer)

        return hs

    def backward(self, dhs):
        Wx, Wh, b = self.params
        N, T, H = dhs.shape
        D, H = Wx.shape

        dxs = np.empty((N, T, D), dtype='f')
        dh = 0
        grads = [0, 0, 0]
        for t in reversed(range(T)):
            layer = self.layers[t]
            dx, dh = layer.backward(dhs[:, t, :] + dh)
            dxs[:, t, :] = dx

            for i, grad in enumerate(layer.grads):
                grads[i] += grad

        for i, grad in enumerate(grads):
            self.grads[i][...] = grad
        self.dh = dh

        return dxs

    def set_state(self, h):
        self.h = h

    def reset_state(self):
        self.h = None

5.4处理时序数据的层的实现

基于RNN的语言模型称为RNNLM。

5.4.1RNNLM的全貌图

RNNLM可以“记忆”目前为止输入的单词,并以此为基础预测接下来会出现的单词。

5.4.2Time层的实现

含有T个时序数据的层称为“Time XX层。
Time的损失是平均损失。

5.5RNNLM的学习和评价

5.5.1RNNLM的实现

class TimeEmbedding:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.layers = None
        self.W = W

    def forward(self, xs):
        N, T = xs.shape
        V, D = self.W.shape

        out = np.empty((N, T, D), dtype='f')
        self.layers = []

        for t in range(T):
            layer = Embedding(self.W)
            out[:, t, :] = layer.forward(xs[:, t])
            self.layers.append(layer)

        return out

    def backward(self, dout):
        N, T, D = dout.shape

        grad = 0
        for t in range(T):
            layer = self.layers[t]
            layer.backward(dout[:, t, :])
            grad += layer.grads[0]

        self.grads[0][...] = grad
        return None


class TimeAffine:
    def __init__(self, W, b):
        self.params = [W, b]
        self.grads = [np.zeros_like(W), np.zeros_like(b)]
        self.x = None

    def forward(self, x):
        N, T, D = x.shape
        W, b = self.params

        rx = x.reshape(N*T, -1)
        out = np.dot(rx, W) + b
        self.x = x
        return out.reshape(N, T, -1)

    def backward(self, dout):
        x = self.x
        N, T, D = x.shape
        W, b = self.params

        dout = dout.reshape(N*T, -1)
        rx = x.reshape(N*T, -1)

        db = np.sum(dout, axis=0)
        dW = np.dot(rx.T, dout)
        dx = np.dot(dout, W.T)
        dx = dx.reshape(*x.shape)

        self.grads[0][...] = dW
        self.grads[1][...] = db

        return dx


class TimeSoftmaxWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.cache = None
        self.ignore_label = -1

    def forward(self, xs, ts):
        N, T, V = xs.shape

        if ts.ndim == 3:  
            ts = ts.argmax(axis=2)

        mask = (ts != self.ignore_label)

        xs = xs.reshape(N * T, V)
        ts = ts.reshape(N * T)
        mask = mask.reshape(N * T)

        ys = softmax(xs)
        ls = np.log(ys[np.arange(N * T), ts])
        ls *= mask
        loss = -np.sum(ls)
        loss /= mask.sum()

        self.cache = (ts, ys, mask, (N, T, V))
        return loss

    def backward(self, dout=1):
        ts, ys, mask, (N, T, V) = self.cache

        dx = ys
        dx[np.arange(N * T), ts] -= 1
        dx *= dout
        dx /= mask.sum()
        dx *= mask[:, np.newaxis]

        dx = dx.reshape((N, T, V))

        return dx
class SimpleRnnlm:
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        rnn_Wx = (rn(D, H) / np.sqrt(D)).astype('f')
        rnn_Wh = (rn(H, H) / np.sqrt(H)).astype('f')
        rnn_b = np.zeros(H).astype('f')
        affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.layers = [
            TimeEmbedding(embed_W),
            TimeRNN(rnn_Wx, rnn_Wh, rnn_b, stateful=True),
            TimeAffine(affine_W, affine_b)
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.rnn_layer = self.layers[1]

        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, xs, ts):
        for layer in self.layers:
            xs = layer.forward(xs)
        loss = self.loss_layer.forward(xs, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        self.rnn_layer.reset_state()

5.5.2语言模型的评价

困惑度表示概率的倒数。困惑度越小越好。
分叉度是指下一个可以选择的选项的数量。
对于语言模型的评价,困惑度越小,分叉都越小,表明模型越好。

5.5.3RNNLM的学习代码

class SGD:

    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for i in range(len(params)):
            params[i] -= self.lr * grads[i]
batch_size = 10
wordvec_size = 100
hidden_size = 100
time_size = 5 
lr = 0.1
max_epoch = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)

xs = corpus[:-1]  
ts = corpus[1:]  
data_size = len(xs)
print('corpus size: %d, vocabulary size: %d' % (corpus_size, vocab_size))

max_iters = data_size // (batch_size * time_size)
time_idx = 0
total_loss = 0
loss_count = 0
ppl_list = []

model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)

jump = (corpus_size - 1) // batch_size
offsets = [i * jump for i in range(batch_size)]

for epoch in range(max_epoch):
    for iter in range(max_iters):

        batch_x = np.empty((batch_size, time_size), dtype='i')
        batch_t = np.empty((batch_size, time_size), dtype='i')
        for t in range(time_size):
            for i, offset in enumerate(offsets):
                batch_x[i, t] = xs[(offset + time_idx) % data_size]
                batch_t[i, t] = ts[(offset + time_idx) % data_size]
            time_idx += 1

        loss = model.forward(batch_x, batch_t)
        model.backward()
        optimizer.update(model.params, model.grads)
        total_loss += loss
        loss_count += 1

    ppl = np.exp(total_loss / loss_count)
    print('| epoch %d | perplexity %.2f'
          % (epoch+1, ppl))
    ppl_list.append(float(ppl))
    total_loss, loss_count = 0, 0

5.5.4RNNLM的Trainer类

class RnnlmTrainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.time_idx = None
        self.ppl_list = None
        self.eval_interval = None
        self.current_epoch = 0

    def get_batch(self, x, t, batch_size, time_size):
        batch_x = np.empty((batch_size, time_size), dtype='i')
        batch_t = np.empty((batch_size, time_size), dtype='i')

        data_size = len(x)
        jump = data_size // batch_size
        offsets = [i * jump for i in range(batch_size)] 

        for time in range(time_size):
            for i, offset in enumerate(offsets):
                batch_x[i, time] = x[(offset + self.time_idx) % data_size]
                batch_t[i, time] = t[(offset + self.time_idx) % data_size]
            self.time_idx += 1
        return batch_x, batch_t

    def fit(self, xs, ts, max_epoch=10, batch_size=20, time_size=35,
            max_grad=None, eval_interval=20):
        data_size = len(xs)
        max_iters = data_size // (batch_size * time_size)
        self.time_idx = 0
        self.ppl_list = []
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            for iters in range(max_iters):
                batch_x, batch_t = self.get_batch(xs, ts, batch_size, time_size)

                loss = model.forward(batch_x, batch_t)
                model.backward()
                params, grads = remove_duplicate(model.params, model.grads)
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    ppl = np.exp(total_loss / loss_count)
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | perplexity %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, ppl))
                    self.ppl_list.append(float(ppl))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1
batch_size = 10
wordvec_size = 100
hidden_size = 100 
time_size = 5 
lr = 0.1
max_epoch = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000 
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)
xs = corpus[:-1]
ts = corpus[1:] 

model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

trainer.fit(xs, ts, max_epoch, batch_size, time_size)

5.6 小节

本章介绍了RNN、Time RNN和RNNLM。学习重点是RNN的原理及对循环结构的理解。

六、Gated RNN

RNN无法很好的学习到时序数据的长期依赖关系。
LSTM和GRU的门结构可以学习到时序数据的长期依赖关系。

6.1 RNN的问题

BPTT会发生梯度消失和梯度爆炸的问题。

6.1.1 RNN的复习

复习了RNN的结构图和计算图。

6.1.2 梯度消失和梯度爆炸

随着时间的回溯,简单RNN未能避免梯度变小或梯度变大。

6.1.3 梯度消失和梯度爆炸的原因

产生梯度消失的原因是激活函数tanh,换成ReLU可抑制这种梯度消失。
L2范数是均方误差。

N = 2  
H = 3  
T = 20  

dh = np.ones((N, H))

np.random.seed(3)

Wh = np.random.randn(H, H)
#Wh = np.random.randn(H, H) * 0.5

norm_list = []
for t in range(T):
    dh = np.dot(dh, Wh.T)
    norm = np.sqrt(np.sum(dh**2)) / N
    norm_list.append(norm)

print(norm_list)

输出为:[2.4684068094579303, 3.3357049741610365, 4.783279375373182, 6.279587332087612, 8.080776465019053, 10.251163032292936, 12.936063506609896, 16.276861327786712, 20.45482961834598, 25.688972842084684, 32.25315718048336, 40.48895641683869, 50.8244073070191, 63.79612654485427, 80.07737014308985, 100.5129892205125, 126.16331847536823, 158.35920648258823, 198.7710796761195, 249.495615421267]
这里输出的梯度随时间呈指数级增加,因此产生梯度爆炸。

N = 2  
H = 3  
T = 20  

dh = np.ones((N, H))

np.random.seed(3)

#Wh = np.random.randn(H, H)
Wh = np.random.randn(H, H) * 0.5

norm_list = []
for t in range(T):
    dh = np.dot(dh, Wh.T)
    norm = np.sqrt(np.sum(dh**2)) / N
    norm_list.append(norm)

print(norm_list)

输出为:[1.2342034047289652, 0.8339262435402591, 0.5979099219216477, 0.39247420825547574, 0.2525242645318454, 0.16017442237957713, 0.10106299614538981, 0.06358148956166684, 0.03995083909833199, 0.025086887541098325, 0.015748611904532892, 0.009884999125204758, 0.006204151282595105, 0.003893806551809953, 0.002443767399386287, 0.0015337065005571365, 0.0009625497320203265, 0.0006040924319556741, 0.00037912574706291106, 0.00023793756048323344]
这里输出的梯度随时间呈指数级减小,因此产生梯度消失。
矩阵Wh被反复乘了T次,奇异值表示数据的离散程度。奇异值>1,则梯度爆炸;奇异值<1,则梯度消失。

6.1.4 梯度爆炸的对策

要进行threshold,再据它做梯度裁剪。

import numpy as np


dW1 = np.random.rand(3, 3) * 10
dW2 = np.random.rand(3, 3) * 10
grads = [dW1, dW2]
max_norm = 5.0


def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate


print('before:', dW1.flatten())
clip_grads(grads, max_norm)
print('after:', dW1.flatten())

输出为:before: [6.49144048 2.78487283 6.76254902 5.90862817 0.23981882 5.58854088
2.59252447 4.15101197 2.83525082]
after: [1.49503731 0.64138134 1.55747605 1.36081038 0.05523244 1.28709139
0.59708178 0.95601551 0.65298384]
输入梯度和阈值后,输出为裁剪了的梯度。
梯度裁剪能解决梯度爆炸的问题。

6.2 梯度消失和LSTM

6.2.1 LSTM的接口

LSTM和RNN的不同在于多了记忆单元c(记忆单元),其特点是仅在LSTM层内部接受和传递数据。

6.2.2 LSTM层的结构

仍在每层输出隐藏状态,记忆单元只在层与层之间传递,但不输出。
ct和ht的数量相等。
门的开合程度也是从数据中学习的。

6.2.3 输出门

这里考虑对tanh(ct)施加门。针对tanh(ct)的各个元素,调整它们作为下一时刻的隐藏状态的重要程度。这个门称为输出门。输出门的开合程度根据输入xt和上一个状态ht-1求出。
tanh的值域在-1到1,表示某种被编码信息的强弱。sigmoid的值域在0到1,表示数据流出的比例。

6.2.4 遗忘门

在记忆单元ct-1上添加一个忘记不必要记忆的门,就是遗忘门。

6.2.5 新的记忆单元

向记忆单元添加一些应当记住的新信息,添加新的tanh节点。

6.2.6 输入门

给g添加的门称为输出门。

6.2.7 LSTM的梯度的流动

LSTM不会引起梯度消失(爆炸)的原因:反向传播不进行矩阵乘积运算,而是进行元素乘积运算(阿达玛积)。
LSTM是可以长时间维持的短期记忆。

6.3 LSTM的实现

LSTM和RNN实现的不同是多了记忆单元:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class LSTM:
    def __init__(self, Wx, Wh, b):

        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.cache = None

    def forward(self, x, h_prev, c_prev):
        Wx, Wh, b = self.params
        N, H = h_prev.shape

        A = np.dot(x, Wx) + np.dot(h_prev, Wh) + b

        f = A[:, :H]
        g = A[:, H:2*H]
        i = A[:, 2*H:3*H]
        o = A[:, 3*H:]

        f = sigmoid(f)
        g = np.tanh(g)
        i = sigmoid(i)
        o = sigmoid(o)

        c_next = f * c_prev + g * i
        h_next = o * np.tanh(c_next)

        self.cache = (x, h_prev, c_prev, i, f, g, o, c_next)
        return h_next, c_next

    def backward(self, dh_next, dc_next):
        Wx, Wh, b = self.params
        x, h_prev, c_prev, i, f, g, o, c_next = self.cache

        tanh_c_next = np.tanh(c_next)

        ds = dc_next + (dh_next * o) * (1 - tanh_c_next ** 2)

        dc_prev = ds * f

        di = ds * g
        df = ds * c_prev
        do = dh_next * tanh_c_next
        dg = ds * i

        di *= i * (1 - i)
        df *= f * (1 - f)
        do *= o * (1 - o)
        dg *= (1 - g ** 2)

        dA = np.hstack((df, dg, di, do))

        dWh = np.dot(h_prev.T, dA)
        dWx = np.dot(x.T, dA)
        db = dA.sum(axis=0)

        self.grads[0][...] = dWx
        self.grads[1][...] = dWh
        self.grads[2][...] = db

        dx = np.dot(dA, Wx.T)
        dh_prev = np.dot(dA, Wh.T)

        return dx, dh_prev, dc_prev

TimeLSTM由t个LSTM组成:

class TimeLSTM:
    def __init__(self, Wx, Wh, b, stateful=False):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.layers = None

        self.h, self.c = None, None
        self.dh = None
        self.stateful = stateful

    def forward(self, xs):
        Wx, Wh, b = self.params
        N, T, D = xs.shape
        H = Wh.shape[0]

        self.layers = []
        hs = np.empty((N, T, H), dtype='f')

        if not self.stateful or self.h is None:
            self.h = np.zeros((N, H), dtype='f')
        if not self.stateful or self.c is None:
            self.c = np.zeros((N, H), dtype='f')

        for t in range(T):
            layer = LSTM(*self.params)
            self.h, self.c = layer.forward(xs[:, t, :], self.h, self.c)
            hs[:, t, :] = self.h

            self.layers.append(layer)

        return hs

    def backward(self, dhs):
        Wx, Wh, b = self.params
        N, T, H = dhs.shape
        D = Wx.shape[0]

        dxs = np.empty((N, T, D), dtype='f')
        dh, dc = 0, 0

        grads = [0, 0, 0]
        for t in reversed(range(T)):
            layer = self.layers[t]
            dx, dh, dc = layer.backward(dhs[:, t, :] + dh, dc)
            dxs[:, t, :] = dx
            for i, grad in enumerate(layer.grads):
                grads[i] += grad

        for i, grad in enumerate(grads):
            self.grads[i][...] = grad
        self.dh = dh
        return dxs

    def set_state(self, h, c=None):
        self.h, self.c = h, c

    def reset_state(self):
        self.h, self.c = None, None

6.4 使用LSTM的语言模型

import os
def to_cpu(x):
    import numpy
    if type(x) == numpy.ndarray:
        return x
    return np.asnumpy(x)


def to_gpu(x):
    import cupy
    if type(x) == cupy.ndarray:
        return x
    return cupy.asarray(x)


class BaseModel:
    def __init__(self):
        self.params, self.grads = None, None

    def forward(self, *args):
        raise NotImplementedError

    def backward(self, *args):
        raise NotImplementedError

    def save_params(self, file_name=None):
        if file_name is None:
            file_name = self.__class__.__name__ + '.pkl'

        params = [p.astype(np.float16) for p in self.params]
        if GPU:
            params = [to_cpu(p) for p in params]

        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name=None):
        if file_name is None:
            file_name = self.__class__.__name__ + '.pkl'

        if '/' in file_name:
            file_name = file_name.replace('/', os.sep)

        if not os.path.exists(file_name):
            raise IOError('No file: ' + file_name)

        with open(file_name, 'rb') as f:
            params = pickle.load(f)

        params = [p.astype('f') for p in params]
        if GPU:
            params = [to_gpu(p) for p in params]

        for i, param in enumerate(self.params):
            param[...] = params[i]

class Rnnlm(BaseModel):
    def __init__(self, vocab_size=10000, wordvec_size=100, hidden_size=100):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')
        affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.layers = [
            TimeEmbedding(embed_W),
            TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True),
            TimeAffine(affine_W, affine_b)
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.lstm_layer = self.layers[1]

        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def predict(self, xs):
        for layer in self.layers:
            xs = layer.forward(xs)
        return xs

    def forward(self, xs, ts):
        score = self.predict(xs)
        loss = self.loss_layer.forward(score, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        self.lstm_layer.reset_state()
import sys
def eval_perplexity(model, corpus, batch_size=10, time_size=35):
    print('evaluating perplexity ...')
    corpus_size = len(corpus)
    total_loss = 0
    max_iters = (corpus_size - 1) // (batch_size * time_size)
    jump = (corpus_size - 1) // batch_size

    for iters in range(max_iters):
        xs = np.zeros((batch_size, time_size), dtype=np.int32)
        ts = np.zeros((batch_size, time_size), dtype=np.int32)
        time_offset = iters * time_size
        offsets = [time_offset + (i * jump) for i in range(batch_size)]
        for t in range(time_size):
            for i, offset in enumerate(offsets):
                xs[i, t] = corpus[(offset + t) % corpus_size]
                ts[i, t] = corpus[(offset + t + 1) % corpus_size]

        try:
            loss = model.forward(xs, ts, train_flg=False)
        except TypeError:
            loss = model.forward(xs, ts)
        total_loss += loss

        sys.stdout.write('\r%d / %d' % (iters, max_iters))
        sys.stdout.flush()

    print('')
    ppl = np.exp(total_loss / max_iters)
    return ppl
batch_size = 20
wordvec_size = 100
hidden_size = 100  
time_size = 35  
lr = 20.0
max_epoch = 4
max_grad = 0.25

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_test, _, _ = ptb.load_data('test')
vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]

model = Rnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

trainer.fit(xs, ts, max_epoch, batch_size, time_size, max_grad,
            eval_interval=20)
trainer.plot(ylim=(0, 500))

model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)

model.save_params()

eval_perplexity是评价困惑度的指标,随着训练轮次的增加,困惑度下降。

6.5 进一步改进RNNLM

6.5.1 LSTM层的多层化

需要构建高精度模型的时候,加深LSTM层是有效方法,不过层数也是超参数。

6.5.2 基于Dropout抑制过拟合

RNN比常规的前馈神经网络更容易发生过拟合。
抑制过拟合的三种方式:1增加训练数据 2降低模型复杂度 3对模型复杂度给予惩罚的正则化。
dropout属于第三种,它随机选择一部分神经元,然后忽略他们,可以提高神经网络的泛化能力。
对于LSTM,dropout不应插在时序方向上,而应插在深度方向上。
mask指的是决定是否传递数据的随即布尔值。

6.5.3 权重共享

使用权重共享只需将embedding层权重转置设置为affine层的权重。
共享权重可以减少需要学习的参数量。

6.5.4 更好的RNNLM的实现

class TimeDropout:
    def __init__(self, dropout_ratio=0.5):
        self.params, self.grads = [], []
        self.dropout_ratio = dropout_ratio
        self.mask = None
        self.train_flg = True

    def forward(self, xs):
        if self.train_flg:
            flg = np.random.rand(*xs.shape) > self.dropout_ratio
            scale = 1 / (1.0 - self.dropout_ratio)
            self.mask = flg.astype(np.float32) * scale

            return xs * self.mask
        else:
            return xs

    def backward(self, dout):
        return dout * self.mask
class BetterRnnlm(BaseModel):

    def __init__(self, vocab_size=10000, wordvec_size=650,
                 hidden_size=650, dropout_ratio=0.5):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx1 = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh1 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b1 = np.zeros(4 * H).astype('f')
        lstm_Wx2 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_Wh2 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b2 = np.zeros(4 * H).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.layers = [
            TimeEmbedding(embed_W),
            TimeDropout(dropout_ratio),
            TimeLSTM(lstm_Wx1, lstm_Wh1, lstm_b1, stateful=True),
            TimeDropout(dropout_ratio),
            TimeLSTM(lstm_Wx2, lstm_Wh2, lstm_b2, stateful=True),
            TimeDropout(dropout_ratio),
            TimeAffine(embed_W.T, affine_b)  # weight tying!!
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.lstm_layers = [self.layers[2], self.layers[4]]
        self.drop_layers = [self.layers[1], self.layers[3], self.layers[5]]

        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def predict(self, xs, train_flg=False):
        for layer in self.drop_layers:
            layer.train_flg = train_flg

        for layer in self.layers:
            xs = layer.forward(xs)
        return xs

    def forward(self, xs, ts, train_flg=True):
        score = self.predict(xs, train_flg)
        loss = self.loss_layer.forward(score, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        for layer in self.lstm_layers:
            layer.reset_state()
batch_size = 20
wordvec_size = 650
hidden_size = 650
time_size = 35
lr = 20.0
max_epoch = 40
max_grad = 0.25
dropout = 0.5

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_val, _, _ = ptb.load_data('val')
corpus_test, _, _ = ptb.load_data('test')

if config.GPU:
    corpus = to_gpu(corpus)
    corpus_val = to_gpu(corpus_val)
    corpus_test = to_gpu(corpus_test)

vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]

model = BetterRnnlm(vocab_size, wordvec_size, hidden_size, dropout)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

best_ppl = float('inf')
for epoch in range(max_epoch):
    trainer.fit(xs, ts, max_epoch=1, batch_size=batch_size,
                time_size=time_size, max_grad=max_grad)

    model.reset_state()
    ppl = eval_perplexity(model, corpus_val)
    print('valid perplexity: ', ppl)

    if best_ppl > ppl:
        best_ppl = ppl
        model.save_params()
    else:
        lr /= 4.0
        optimizer.lr = lr

    model.reset_state()
    print('-' * 50)


model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)

以上代码将上面三个技巧(加深层,dropout,权重共享)进行了实现和在PTB数据上的实战。

6.5.5 前沿研究

不同的dropout函数值得被进一步了解。

6.2 小结

RNN常存在梯度消失和爆炸的问题。
梯度裁剪可以应对梯度爆炸。LSTM、GRU可以应对梯度消失。
相较于RNN,LSTM多了门的机制。LSTM中有三个门:输入门、遗忘门和输出门。门有专门的权重。
更好的语言模型的三个技巧(加深层,dropout和权重共享)得到了实现。
在这里插入图片描述

七、基于RNN生成文本

本章将使用语言模型进行文本生成,介绍seq2seq的新神经网络。

7.1 使用语言模型生成文本

7.1.1 使用RNN生成文本的步骤

一般生成的是概率最高的,这样的选择是确定性的。而根据概率分布进行选择,则每次生成的结果都是不一样的。
存在变化的可能性就是概率性的。
语言模型不是背诵了训练数据,而是学习了训练数据中单词的排列模式。如果这种学习正确学习了单词的出现模式,那么它就是自然的、有意义的。

7.1.2 文本生成的实现

类的继承是指继承已有的类,创建新的类。

class RnnlmGen(Rnnlm):
    def generate(self, start_id, skip_ids=None, sample_size=100):
        word_ids = [start_id]

        x = start_id
        while len(word_ids) < sample_size:
            x = np.array(x).reshape(1, 1)
            score = self.predict(x)
            p = softmax(score.flatten())

            sampled = np.random.choice(len(p), size=1, p=p)
            if (skip_ids is None) or (sampled not in skip_ids):
                x = sampled
                word_ids.append(int(x))

        return word_ids

    def get_state(self):
        return self.lstm_layer.h, self.lstm_layer.c

    def set_state(self, state):
        self.lstm_layer.set_state(*state)
corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
corpus_size = len(corpus)

model = RnnlmGen()
model.load_params('./ch06/Rnnlm.pkl')

start_word = 'you'
start_id = word_to_id[start_word]
skip_words = ['N', '<unk>', '$']
skip_ids = [word_to_id[w] for w in skip_words]

word_ids = model.generate(start_id, skip_ids)
txt = ' '.join([id_to_word[i] for i in word_ids])
txt = txt.replace(' <eos>', '.\n')
print(txt)

7.1.3 更好的文本生成

用更好的RNNLM模型才能实现更好的文本生成:

class BetterRnnlmGen(BetterRnnlm):
    def generate(self, start_id, skip_ids=None, sample_size=100):
        word_ids = [start_id]

        x = start_id
        while len(word_ids) < sample_size:
            x = np.array(x).reshape(1, 1)
            score = self.predict(x).flatten()
            p = softmax(score).flatten()

            sampled = np.random.choice(len(p), size=1, p=p)
            if (skip_ids is None) or (sampled not in skip_ids):
                x = sampled
                word_ids.append(int(x))

        return word_ids

    def get_state(self):
        states = []
        for layer in self.lstm_layers:
            states.append((layer.h, layer.c))
        return states

    def set_state(self, states):
        for layer, state in zip(self.lstm_layers, states):
            layer.set_state(*state)
model.reset_state()

start_words = 'the meaning of life is'
start_ids = [word_to_id[w] for w in start_words.split(' ')]

for x in start_ids[:-1]:
    x = np.array(x).reshape(1, 1)
    model.predict(x)

word_ids = model.generate(start_ids[-1], skip_ids)
word_ids = start_ids[:-1] + word_ids
txt = ' '.join([id_to_word[i] for i in word_ids])
txt = txt.replace(' <eos>', '.\n')
print('-' * 50)
print(txt)

7.2 seq2seq模型

seq2seq模型是两个时序数据相互转换的模型。

7.2.1 seq2seq的原理

包括两个模块:编码器Encoder和解码器Decoder。
可以通过翻译任务来理解编码器和解码器。编码器和解码器的结构如本章末的图所示。
编码就是将任意长度的文本转换为一个固定长度的向量。
分隔符可以用来指示解码器的“开始和结束。”

7.2.2 时序数据转换的简单尝试

由于seq2seq学习加法的例子,我们将不以单词为单位,而是以字符为单位进行分割。

7.2.3 可变长度的时序数据

一个批次内各个样本的数据形状是一致的。mini-batch最简单的方式是填充(padding)。通过填充对其数据的大小,可以处理可变长度的时序数据。
填充在编码器和解码器中的特殊处理:在解码器中输入填充时,不应计算其损失;在编码器中输入填充时,LSTM层应按原样输出上一时刻的输入。

7.2.4 加法数据集

加法数据集是在中间加入空格进行填充。
下面是该数据集的读入和实现数据的映射:

from dataset import sequence


(x_train, t_train), (x_test, t_test) = \
    sequence.load_data('addition.txt', seed=1984)
char_to_id, id_to_char = sequence.get_vocab()

print(x_train.shape, t_train.shape)
print(x_test.shape, t_test.shape)
# (45000, 7) (45000, 5)
# (5000, 7) (5000, 5)

print(x_train[0])
print(t_train[0])
# [ 3  0  2  0  0 11  5]
# [ 6  0 11  7  5]

print(''.join([id_to_char[c] for c in x_train[0]]))
print(''.join([id_to_char[c] for c in t_train[0]]))
# 71+118
# _189

7.3 seq2seq的实现

分两个部分:一个部分是Encoder,另一个部分是:Decoder,然后将两者进行组合实现seq2seq。

7.3.1 Encoder类

Encoder类接收字符串,将其转化为向量h。
编码器只将LSTM的隐藏状态传递给解码器。

class Encoder:
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')

        self.embed = TimeEmbedding(embed_W)
        self.lstm = TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=False)

        self.params = self.embed.params + self.lstm.params
        self.grads = self.embed.grads + self.lstm.grads
        self.hs = None

    def forward(self, xs):
        xs = self.embed.forward(xs)
        hs = self.lstm.forward(xs)
        self.hs = hs
        return hs[:, -1, :]

    def backward(self, dh):
        dhs = np.zeros_like(self.hs)
        dhs[:, -1, :] = dh

        dout = self.lstm.backward(dhs)
        dout = self.embed.backward(dout)
        return dout

只有对于长数据才将stateful设置为True。

7.3.2 Decoder类

Decoder类接收Encoder类输出的h,输出目标字符串。
通过argmax节点从Affine层的输出中选择最大值的索引。从而将概率性答案转化成了确定性的答案。

class Decoder:
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')
        affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.embed = TimeEmbedding(embed_W)
        self.lstm = TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True)
        self.affine = TimeAffine(affine_W, affine_b)

        self.params, self.grads = [], []
        for layer in (self.embed, self.lstm, self.affine):
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, xs, h):
        self.lstm.set_state(h)

        out = self.embed.forward(xs)
        out = self.lstm.forward(out)
        score = self.affine.forward(out)
        return score

    def backward(self, dscore):
        dout = self.affine.backward(dscore)
        dout = self.lstm.backward(dout)
        dout = self.embed.backward(dout)
        dh = self.lstm.dh
        return dh

    def generate(self, h, start_id, sample_size):
        sampled = []
        sample_id = start_id
        self.lstm.set_state(h)

        for _ in range(sample_size):
            x = np.array(sample_id).reshape((1, 1))
            out = self.embed.forward(x)
            out = self.lstm.forward(out)
            score = self.affine.forward(out)

            sample_id = np.argmax(score.flatten())
            sampled.append(int(sample_id))

        return sampled

7.3.3 seq2seq类

class Seq2seq(BaseModel):
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        self.encoder = Encoder(V, D, H)
        self.decoder = Decoder(V, D, H)
        self.softmax = TimeSoftmaxWithLoss()

        self.params = self.encoder.params + self.decoder.params
        self.grads = self.encoder.grads + self.decoder.grads

    def forward(self, xs, ts):
        decoder_xs, decoder_ts = ts[:, :-1], ts[:, 1:]

        h = self.encoder.forward(xs)
        score = self.decoder.forward(decoder_xs, h)
        loss = self.softmax.forward(score, decoder_ts)
        return loss

    def backward(self, dout=1):
        dout = self.softmax.backward(dout)
        dh = self.decoder.backward(dout)
        dout = self.encoder.backward(dh)
        return dout

    def generate(self, xs, start_id, sample_size):
        h = self.encoder.forward(xs)
        sampled = self.decoder.generate(h, start_id, sample_size)
        return sampled

7.3.4 seq2seq的评价

学习流程:1从训练数据中选择一个mini-batch 2基于mini-batch计算梯度 3使用梯度更新权重

def eval_seq2seq(model, question, correct, id_to_char,
                 verbose=False, is_reverse=False):
    correct = correct.flatten()

    start_id = correct[0]
    correct = correct[1:]
    guess = model.generate(question, start_id, len(correct))

    question = ''.join([id_to_char[int(c)] for c in question.flatten()])
    correct = ''.join([id_to_char[int(c)] for c in correct])
    guess = ''.join([id_to_char[int(c)] for c in guess])

    if verbose:
        if is_reverse:
            question = question[::-1]

        colors = {'ok': '\033[92m', 'fail': '\033[91m', 'close': '\033[0m'}
        print('Q', question)
        print('T', correct)

        is_windows = os.name == 'nt'

        if correct == guess:
            mark = colors['ok'] + '☑' + colors['close']
            if is_windows:
                mark = 'O'
            print(mark + ' ' + guess)
        else:
            mark = colors['fail'] + '☒' + colors['close']
            if is_windows:
                mark = 'X'
            print(mark + ' ' + guess)
        print('---')
(x_train, t_train), (x_test, t_test) = sequence.load_data('addition.txt')
char_to_id, id_to_char = sequence.get_vocab()

is_reverse = False  # True
if is_reverse:
    x_train, x_test = x_train[:, ::-1], x_test[:, ::-1]
# ================================================================

vocab_size = len(char_to_id)
wordvec_size = 16
hidden_size = 128
batch_size = 128
max_epoch = 25
max_grad = 5.0

# Normal or Peeky? ==============================================
model = Seq2seq(vocab_size, wordvec_size, hidden_size)
# model = PeekySeq2seq(vocab_size, wordvec_size, hidden_size)
# ================================================================
optimizer = Adam()
trainer = Trainer(model, optimizer)

acc_list = []
for epoch in range(max_epoch):
    trainer.fit(x_train, t_train, max_epoch=1,
                batch_size=batch_size, max_grad=max_grad)

    correct_num = 0
    for i in range(len(x_test)):
        question, correct = x_test[[i]], t_test[[i]]
        verbose = i < 10
        correct_num += eval_seq2seq(model, question, correct,
                                    id_to_char, verbose, is_reverse)

    acc = float(correct_num) / len(x_test)
    acc_list.append(acc)
    print('val acc %.3f%%' % (acc * 100))

随着学习的积累,正确率有所提升。

7.4 seq2seq的改进

7.4.1 反转输入数据(Reverse)

反转输入数据的顺序。
反转后的学习进展更快,最终的精度也有提高。
使用x_train[:,::-1]反转数组的排列。
反转数据后梯度的传播可以更平滑。
通过反转,输入语句的开始部分和对应的转换后的单词之间的距离变近,所以梯度的传播变得更容易,学习效率也更高。

7.4.2 偷窥(Peeky)

将集中了重要信息的编码器的输出h分配给编码器的其他层。

class PeekyDecoder:
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx = (rn(H + D, 4 * H) / np.sqrt(H + D)).astype('f')
        lstm_Wh = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')
        affine_W = (rn(H + H, V) / np.sqrt(H + H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.embed = TimeEmbedding(embed_W)
        self.lstm = TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True)
        self.affine = TimeAffine(affine_W, affine_b)

        self.params, self.grads = [], []
        for layer in (self.embed, self.lstm, self.affine):
            self.params += layer.params
            self.grads += layer.grads
        self.cache = None

    def forward(self, xs, h):
        N, T = xs.shape
        N, H = h.shape

        self.lstm.set_state(h)

        out = self.embed.forward(xs)
        hs = np.repeat(h, T, axis=0).reshape(N, T, H)
        out = np.concatenate((hs, out), axis=2)

        out = self.lstm.forward(out)
        out = np.concatenate((hs, out), axis=2)

        score = self.affine.forward(out)
        self.cache = H
        return score

    def backward(self, dscore):
        H = self.cache

        dout = self.affine.backward(dscore)
        dout, dhs0 = dout[:, :, H:], dout[:, :, :H]
        dout = self.lstm.backward(dout)
        dembed, dhs1 = dout[:, :, H:], dout[:, :, :H]
        self.embed.backward(dembed)

        dhs = dhs0 + dhs1
        dh = self.lstm.dh + np.sum(dhs, axis=1)
        return dh

    def generate(self, h, start_id, sample_size):
        sampled = []
        char_id = start_id
        self.lstm.set_state(h)

        H = h.shape[1]
        peeky_h = h.reshape(1, 1, H)
        for _ in range(sample_size):
            x = np.array([char_id]).reshape((1, 1))
            out = self.embed.forward(x)

            out = np.concatenate((peeky_h, out), axis=2)
            out = self.lstm.forward(out)
            out = np.concatenate((peeky_h, out), axis=2)
            score = self.affine.forward(out)

            char_id = np.argmax(score.flatten())
            sampled.append(char_id)

        return sampled

上面的前向函数中将编码器的h和解码器的输出进行拼接。权重参数的形状相应的变大了。

class PeekySeq2seq(Seq2seq):
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        self.encoder = Encoder(V, D, H)
        self.decoder = PeekyDecoder(V, D, H)
        self.softmax = TimeSoftmaxWithLoss()

        self.params = self.encoder.params + self.decoder.params
        self.grads = self.encoder.grads + self.decoder.grads
(x_train, t_train), (x_test, t_test) = sequence.load_data('addition.txt')
char_to_id, id_to_char = sequence.get_vocab()

is_reverse = True
if is_reverse:
    x_train, x_test = x_train[:, ::-1], x_test[:, ::-1]
# ================================================================

vocab_size = len(char_to_id)
wordvec_size = 16
hidden_size = 128
batch_size = 128
max_epoch = 25
max_grad = 5.0

# Normal or Peeky? ==============================================
# model = Seq2seq(vocab_size, wordvec_size, hidden_size)
model = PeekySeq2seq(vocab_size, wordvec_size, hidden_size)
# ================================================================
optimizer = Adam()
trainer = Trainer(model, optimizer)

acc_list = []
for epoch in range(max_epoch):
    trainer.fit(x_train, t_train, max_epoch=1,
                batch_size=batch_size, max_grad=max_grad)

    correct_num = 0
    for i in range(len(x_test)):
        question, correct = x_test[[i]], t_test[[i]]
        verbose = i < 10
        correct_num += eval_seq2seq(model, question, correct,
                                    id_to_char, verbose, is_reverse)

    acc = float(correct_num) / len(x_test)
    acc_list.append(acc)
    print('val acc %.3f%%' % (acc * 100))

使用Peeky后,计算量会增加,seq2seq的精度随着超参数的调整大幅变化。效果可能不稳定。

7.5 seq2seq的应用

7.5.1 聊天机器人

像chatGPT那种,输入文本,生成文本。

7.5.2 算法学习

输入算法,生成算法的运行结果。

7.5.3 自动图像描述

输入图像,生成文本对图像进行描述。

7.6 小结

学习了文本生成的seq2seq的结构,其拼接了编码器和解码器,组合了两个RNN的简单结构。
还有两个改进seq2seq的方案:Reverse和Peeky。

在这里插入图片描述

八、Attention

Attention是深度学习领域最重要的技术之一。

8.1 Attention的结构

将介绍强化之前的seq2seq的注意力机制。

8.1.1 seq2seq存在的问题

编码器的输出是固定向量。

8.1.2 编码器的改进

编码器的输出的长度应该根据文本的长度相应地改变。
使用各个时刻(各个单词)的隐藏状态向量,可以获得和输入的单词数相同数量的向量。编码器输出的hs矩阵就可以视为各个单词对应的向量集合。
将编码器的全部时刻的隐藏状态取出来,编码器可以根据输入语句的长度,成比例的编码信息。

8.1.3 解码器的改进1


8.1.4 解码器的改进2


8.1.5 解码器的改进3


8.2 带Attention的seq2seq的实现

8.2.1 编码器的实现


8.2.2 解码器的实现


8.2.3 seq2seq的实现


8.3 Attention的评价

8.3.1 日期格式转换问题


8.3.2 带Attention的seq2seq的学习


8.3.3 Attention的可视化


8.4 Attention的结构

8.4.1 双向RNN


8.4.2 Attention层的使用方法


8.4.3 seq2seq的深层化和skip connection


8.5 Attention的应用

8.5.1 GNMT


8.5.2 Transformer


8.5.3 NTM



在这里插入图片描述


总结

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值