《深度学习进阶:自然语言处理》

一 神经网络的复习

这一张主要是对上一本书《深度学习入门》内容的总结,复习可戳https://blog.csdn.net/qq_45461377/article/details/124915144?spm=1001.2014.3001.5501

二 自然语言和单词的分布式表示

2.1什么是自然语言处理

自然语言处理是让计算机处理自然语言的技术。自然语言是柔软的,而计算机语言(编程)是硬的,让计算机读懂并处理自然语言的技术目前主要分为三类:1、基于同义词词典的方法;2、基于计数的方法;3基于推理的方法(word2vec)。本章将先介绍前两种。

2.2同义词词典

同义词词典是人为定义的单词之间粒度关系的词典,把意思相近的单词放得更近,同时表示单词之间上下、左右关系的词典。
最著名的同义词词典是WordNet。
同义词词典存在的问题:1、难以顺应时代变化;2、人工成本高;3、无法表示单词的微妙差异。

2.3基于计数的方法

基于计数的方法使用语料库来处理自然语言。

2.3.1基于python的语料库的预处理

text='You say goodbye and I say hello.'
text=text.lower()
text=text.replace('.',' .')
print(text)

输出:you say goodbye and i say hello .

words=text.split(' ')
print(words)

输出:[‘you’, ‘say’, ‘goodbye’, ‘and’, ‘i’, ‘say’, ‘hello’, ‘.’]

word_to_id={}
id_to_word={}
for word in words:
    if word not in word_to_id:
        new_id=len(word_to_id)
        word_to_id[word]=new_id
        id_to_word[new_id]=word
print(id_to_word)
print(word_to_id)

输出:{0: ‘you’, 1: ‘say’, 2: ‘goodbye’, 3: ‘and’, 4: ‘i’, 5: ‘hello’, 6: ‘.’}
{‘you’: 0, ‘say’: 1, ‘goodbye’: 2, ‘and’: 3, ‘i’: 4, ‘hello’: 5, ‘.’: 6}

print(id_to_word[1])
print(word_to_id['hello'])

输出:say
5

import numpy as np
corpus=[word_to_id[w] for w in words]
corpus=np.array(corpus)
print(corpus)

输出:[0 1 2 3 4 1 5 6]
把上面的部分合成一个函数得:

def preprocess(text):
    text=text.lower()
    text=text.replace('.',' .')
    word_to_id={}
    id_to_word={}

    for word in words:
        if word not in word_to_id:
            new_id=len(word_to_id)
            word_to_id[word]=new_id
            id_to_word[new_id]=word
    
    corpus=[word_to_id[w] for w in words]

    return corpus, word_to_id, id_to_word

text='You say goodbye and I say hello.'
corpus, word_to_id, id_to_word=preprocess(text)

2.3.2单词的分布式表示

将颜色表示类比到单词表示,颜色可以用向量来表示,同样地,单词也可以。

2.3.3分布式假设

分布式假设指某个单词的含义由他周围的单词决定。
上下文指某个单词周围的单词。
窗口大小指周围的单词有多少个。

2.3.4共现矩阵

print(corpus)
print(id_to_word)

输出:[0, 1, 2, 3, 4, 1, 5, 6]
{0: ‘you’, 1: ‘say’, 2: ‘goodbye’, 3: ‘and’, 4: ‘i’, 5: ‘hello’, 6: ‘.’}

def create_co_matrix(corpus,vocab_size,window_size=1):
    corpus_size=len(corpus)
    co_matrix=np.zeros((vocab_size,vocab_size),dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1,window_size+1):
            left_idx=idx-i
            right_idx=idx+i

            if left_idx>=0:
                left_word_id=corpus[left_idx]
                co_matrix[word_id,left_word_id]+=1

            if right_idx<corpus_size:
                right_word_id=corpus[right_idx]
                co_matrix[word_id,right_word_id]+=1
    return co_matrix
print(create_co_matrix(corpus,vocab_size=len(word_to_id)))

输出:[[0 1 0 0 0 0 0]
[1 0 1 0 1 1 0]
[0 1 0 1 0 0 0]
[0 0 1 0 1 0 0]
[0 1 0 1 0 0 0]
[0 1 0 0 0 0 1]
[0 0 0 0 0 1 0]]

2.3.5向量间的相似度

余弦相似度表示两个向量在多大程度上指向同一方向。
L2范数指向量个元素的平方和的平方根。

def cos_similarity(x,y):
    nx=x/np.sqrt(np.sum(x**2))
    ny=y/np.sqrt(np.sum(y**2))
    return np.dot(nx,ny)

def cos_similarity(x,y,eps=1e-8):
    nx=x/np.sqrt(np.sum(x**2)+eps)
    ny=y/np.sqrt(np.sum(y**2)+eps)
    return np.dot(nx,ny)

C=create_co_matrix(corpus,vocab_size=len(word_to_id))

c0=C[word_to_id['you']]
c1=C[word_to_id['i']]
print(cos_similarity(c0,c1))

输出:0.7071067758832467

2.3.6相似单词的排序

定义一个查询词相似的单词按降序显示出来的函数。

def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    # 取出查询词
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('\n[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    # 计算余弦相似度
    vocab_size = len(id_to_word)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)
    
    # 基于余弦相似度,按降序输出值
    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

x=np.array([100,-20,2])
x.argsort()
(-x).argsort()

输出:array([0, 2, 1])

most_similar('you', word_to_id, id_to_word, C)

输出为:
[query] you
goodbye: 0.7071067758832467
i: 0.7071067758832467
hello: 0.7071067758832467
say: 0.0
and: 0.0

2.4基于计数的方法的改进

基于计数的方法使用语料库来处理自然语言。

2.4.1点互信息

点互信息PMI,用来表示词之间的相关性。
正的点互信息PPMI,为解决共出现为0的情况。

def ppmi(C,verbose=False,eps=1e-8):
    M=np.zeros_like(C,dtype=np.float32)
    N=np.sum(C)
    S=np.sum(C,axis=0)
    total=C.shape[0]*C.shape[1]
    cnt=0

    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            pmi=np.log2(C[i,j]*N/S[j]*S[i]+eps)
            M[i,j]=max(0,pmi)

            if verbose:
                cnt+=1
                if cnt%(total//100+1)==0:
                    print('%.1f%% done'%(100*cnt/total))
    
    return M

W=ppmi(C)
print(W)

输出:[[0. 1.8073549 0. 0. 0. 0. 0. ]
[5.807355 0. 4.807355 0. 4.807355 4.807355 0. ]
[0. 2.807355 0. 3.807355 0. 0. 0. ]
[0. 0. 3.807355 0. 3.807355 0. 0. ]
[0. 2.807355 0. 3.807355 0. 0. 0. ]
[0. 2.807355 0. 0. 0. 0. 4.807355 ]
[0. 0. 0. 0. 0. 2.807355 0. ]]
PPMI矩阵的问题,随着语料库词汇量的增加,各单词向量的维度也会增加。

2.4.2降维

降维指保留重要信息,降低向量维度。
降维可以将稀疏矩阵转化为密集矩阵。
本书举例奇异值分解法(SVD),其可将任意矩阵转化为3个矩阵的乘积。

2.4.3基于SVD的降维

U,S,V=np.linalg.svd(W)
print(C[0]) #共现矩阵
print(W[0]) #PPMI矩阵
print(U[0]) #SVD矩阵
print(U[0,:2]) #降维为2维

输出:[0 1 0 0 0 0 0]
[0. 1.8073549 0. 0. 0. 0. 0. ]
[ 0.0000000e+00 1.7478172e-01 3.8205594e-02 -1.1102230e-16
-1.1102230e-16 -9.8386568e-01 -1.2717195e-16]
[0. 0.17478172]

import matplotlib.pyplot as plt
for word, word_id in word_to_id.items():
    plt.annotate(word, (U[word_id,0], U[word_id,1]))

plt.scatter(U[:,0],U[:,1],alpha=0.5)
plt.show()

2.4.4PBT数据集

from dataset import ptb

corpus,word_to_id,id_to_word=ptb.load_data('train')
print('corpus size:', len(corpus))
print('corpus[:30]:', corpus[:30])
print()
print('id_to_word[0]:', id_to_word[0])
print('id_to_word[1]:', id_to_word[1])
print('id_to_word[2]:', id_to_word[2])
print()
print("word_to_id['car']:", word_to_id['car'])
print("word_to_id['happy']:", word_to_id['happy'])
print("word_to_id['lexus']:", word_to_id['lexus'])

输出:corpus size: 929589
corpus[:30]: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29]

id_to_word[0]: aer
id_to_word[1]: banknote
id_to_word[2]: berlitz

word_to_id[‘car’]: 3856
word_to_id[‘happy’]: 4428
word_to_id[‘lexus’]: 7426

2.4.5基于PTB数据集的评价

window_size = 2
wordvec_size = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)
print('counting  co-occurrence ...')
C = create_co_matrix(corpus, vocab_size, window_size)
print('calculating PPMI ...')
W = ppmi(C, verbose=True)

print('calculating SVD ...')
try:
    # truncated SVD (fast!)
    from sklearn.utils.extmath import randomized_svd
    U, S, V = randomized_svd(W, n_components=wordvec_size, n_iter=5,
                             random_state=None)
except ImportError:
    # SVD (slow)
    U, S, V = np.linalg.svd(W)

word_vecs = U[:, :wordvec_size]

querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

输出:Output exceeds the size limit. Open the full output data in a text editor
counting co-occurrence …
calculating PPMI …
1.0% done
2.0% done
3.0% done
4.0% done
5.0% done
6.0% done
7.0% done
8.0% done
9.0% done
10.0% done
11.0% done
12.0% done
13.0% done
14.0% done
15.0% done
16.0% done
17.0% done
18.0% done
19.0% done
20.0% done
21.0% done
22.0% done
23.0% done

motor: 0.7355839014053345
squibb: 0.6746079921722412
transmission: 0.6652272343635559
eli: 0.6599057912826538

2.5小结

本章我们重点学习了单词的向量表示的发展,从预处理到共现矩阵,再到表示单词间关系的PPMI矩阵,最后到将为的SVD矩阵,学会处理词向量,是这个部分代码所重点强调的。

三 word2vec

3.1基于推理的方法和神经网络

上一章我们说到表示词向量流行的方法有两种:基于计数的方法和基于推理的方法,两者都是基于分布式假设,这一章就来介绍基于推理的方法。

3.1.1基于计数的方法的问题

消耗大量的计算资源和时间。
基于计数的方法一次性处理全部学习数据,而基于推理的方法使用部分学习数据逐步学习。

3.1.2基于推理的方法概要

分布式假设:单词含义由周围单词构成。

3.1.3神经网络中单词的处理方法

通过one-hot矩阵进行表示,再进入Matmul层。

import numpy as np

c=np.array([[1,0,0,0,0,0,0]]) #输入
W=np.random.randn(7,3) #权重
h=np.dot(c,W) #中间节点

print(h)

输出:[[-0.70770067 -0.12952097 0.2498067 ]]

class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx

layer=MatMul(W)
h=layer.forward(c)
print(h)

输出:[[-0.70770067 -0.12952097 0.2498067 ]]

class MatMul:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.x = None

    def forward(self, x):
        W, = self.params
        out = np.dot(x, W)
        self.x = x
        return out

    def backward(self, dout):
        W, = self.params
        dx = np.dot(dout, W.T)
        dW = np.dot(self.x.T, dout)
        self.grads[0][...] = dW
        return dx

layer=MatMul(W)
h=layer.forward(c)
print(h)

输出为:[[-0.70770067 -0.12952097 0.2498067 ]]
两种方式得到的结果一样,但为了后面模型的构建,采用第二种MatMul层的方法。·

3.2简单的word2vec

宜CBOW模型为例进行本章的学习。

3.2.1CBOW模型的推理

CBOW模型是根据上下文预测目标词的神经网络,逻辑是 输入层(编码)中间层(解码)输出层。

# 样本的上下文数据
c0 = np.array([[1, 0, 0, 0, 0, 0, 0]])
c1 = np.array([[0, 0, 1, 0, 0, 0, 0]])

# 权重初始化
W_in = np.random.randn(7, 3)
W_out = np.random.randn(3, 7)

# 生成层
in_layer0 = MatMul(W_in)
in_layer1 = MatMul(W_in)
out_layer = MatMul(W_out)

# 正向传播
h0 = in_layer0.forward(c0)
h1 = in_layer1.forward(c1)
h = 0.5 * (h0 + h1)
s = out_layer.forward(h)

print(s)

输出为:[[-1.65562246 1.56105405 -0.34916877 1.04647459 2.04680116 -1.805466
0.90635499]]

3.2.2CBOW模型的学习

模型是个多类别分类的神经网络,用softmax函数和交叉熵误差。

3.2.3word2vec的权重和分布式表示

word2vec用输入层权重Win作为最终的单词分布式表示。

3.3学习数据的准备

3.3.1上下文和目标词

当向神经网络输入上下文时,使目标词出现的概率高

def create_contexts_target(corpus, window_size=1):

    target = corpus[window_size:-window_size]
    contexts = []

    for idx in range(window_size, len(corpus)-window_size):
        cs = []
        for t in range(-window_size, window_size + 1):
            if t == 0:
                continue
            cs.append(corpus[idx + t])
        contexts.append(cs)

    return np.array(contexts), np.array(target)

contexts,target=create_contexts_target(corpus, window_size=1)

print(contexts)
print(target)

输出为:[[0 2]
[1 3]
[2 4]
[3 1]
[4 5]
[1 6]]
[1 2 3 4 1 5]

3.3.2转化为one-hot表示

convert_one_hot()函数将上下文和目标词转化成了one-hot。

def convert_one_hot(corpus, vocab_size):

    N = corpus.shape[0]

    if corpus.ndim == 1:
        one_hot = np.zeros((N, vocab_size), dtype=np.int32)
        for idx, word_id in enumerate(corpus):
            one_hot[idx, word_id] = 1

    elif corpus.ndim == 2:
        C = corpus.shape[1]
        one_hot = np.zeros((N, C, vocab_size), dtype=np.int32)
        for idx_0, word_ids in enumerate(corpus):
            for idx_1, word_id in enumerate(word_ids):
                one_hot[idx_0, idx_1, word_id] = 1

    return one_hot

vocab_size=len(word_to_id)
contexts=convert_one_hot(contexts,vocab_size)
target=convert_one_hot(target,vocab_size)

3.4CBOW模型的实现

def softmax(x):
    if x.ndim == 2:
        x = x - x.max(axis=1, keepdims=True)
        x = np.exp(x)
        x /= x.sum(axis=1, keepdims=True)
    elif x.ndim == 1:
        x = x - np.max(x)
        x = np.exp(x) / np.sum(np.exp(x))

    return x


def cross_entropy_error(y, t):
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
        
    if t.size == y.size:
        t = t.argmax(axis=1)
             
    batch_size = y.shape[0]

    return -np.sum(np.log(y[np.arange(batch_size), t] + 1e-7)) / batch_size

    
class SoftmaxWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.y = None  
        self.t = None  

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)

        if self.t.size == self.y.size:
            self.t = self.t.argmax(axis=1)

        loss = cross_entropy_error(self.y, self.t)
        return loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = self.y.copy()
        dx[np.arange(batch_size), self.t] -= 1
        dx *= dout
        dx = dx / batch_size

        return dx
    
class SimpleCBOW:
    def __init__(self, vocab_size, hidden_size):
        V, H = vocab_size, hidden_size

        # 初始化权重
        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(H, V).astype('f')

        # 生成层
        self.in_layer0 = MatMul(W_in)
        self.in_layer1 = MatMul(W_in)
        self.out_layer = MatMul(W_out)
        self.loss_layer = SoftmaxWithLoss()

        # 将所有的权重和梯度整理到列表中
        layers = [self.in_layer0, self.in_layer1, self.out_layer]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        # 将单词的分布式表示设置为成员变量
        self.word_vecs = W_in

    def forward(self, contexts, target):
        h0 = self.in_layer0.forward(contexts[:, 0])
        h1 = self.in_layer1.forward(contexts[:, 1])
        h = (h0 + h1) * 0.5
        score = self.out_layer.forward(h)
        loss = self.loss_layer.forward(score, target)
        return loss

    def backward(self, dout=1):
        ds = self.loss_layer.backward(dout)
        da = self.out_layer.backward(ds)
        da *= 0.5
        self.in_layer1.backward(da)
        self.in_layer0.backward(da)
        return None

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
        
    def update(self, params, grads):
        if self.m is None:
            self.m, self.v = [], []
            for param in params:
                self.m.append(np.zeros_like(param))
                self.v.append(np.zeros_like(param))
        
        self.iter += 1
        lr_t = self.lr * np.sqrt(1.0 - self.beta2**self.iter) / (1.0 - self.beta1**self.iter)

        for i in range(len(params)):
            self.m[i] += (1 - self.beta1) * (grads[i] - self.m[i])
            self.v[i] += (1 - self.beta2) * (grads[i]**2 - self.v[i])
            
            params[i] -= lr_t * self.m[i] / (np.sqrt(self.v[i]) + 1e-7)

import numpy
import time
import matplotlib.pyplot as plt
import numpy as np

def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate
def remove_duplicate(params, grads):

    params, grads = params[:], grads[:]  # copy list

    while True:
        find_flg = False
        L = len(params)

        for i in range(0, L - 1):
            for j in range(i + 1, L):

                if params[i] is params[j]:
                    grads[i] += grads[j]  
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)
                
                elif params[i].ndim == 2 and params[j].ndim == 2 and \
                     params[i].T.shape == params[j].shape and np.all(params[i].T == params[j]):
                    grads[i] += grads[j].T
                    find_flg = True
                    params.pop(j)
                    grads.pop(j)

                if find_flg: break
            if find_flg: break

        if not find_flg: break

    return params, grads

class Trainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.loss_list = []
        self.eval_interval = None
        self.current_epoch = 0

    def fit(self, x, t, max_epoch=10, batch_size=32, max_grad=None, eval_interval=20):
        data_size = len(x)
        max_iters = data_size // batch_size
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            
            idx = numpy.random.permutation(numpy.arange(data_size))
            x = x[idx]
            t = t[idx]

            for iters in range(max_iters):
                batch_x = x[iters*batch_size:(iters+1)*batch_size]
                batch_t = t[iters*batch_size:(iters+1)*batch_size]

                loss = model.forward(batch_x, batch_t)
                model.backward()
                params, grads = remove_duplicate(model.params, model.grads) 
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    avg_loss = total_loss / loss_count
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | loss %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, avg_loss))
                    self.loss_list.append(float(avg_loss))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

    def plot(self, ylim=None):
        x = numpy.arange(len(self.loss_list))
        if ylim is not None:
            plt.ylim(*ylim)
        plt.plot(x, self.loss_list, label='train')
        plt.xlabel('iterations (x' + str(self.eval_interval) + ')')
        plt.ylabel('loss')
        plt.show()

window_size = 1
hidden_size = 5
batch_size = 3
max_epoch = 1000

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

vocab_size = len(word_to_id)
contexts, target = create_contexts_target(corpus, window_size)
target = convert_one_hot(target, vocab_size)
contexts = convert_one_hot(contexts, vocab_size)

model = SimpleCBOW(vocab_size, hidden_size)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

word_vecs = model.word_vecs
for word_id, word in id_to_word.items():
    print(word, word_vecs[word_id])

输出为:Output exceeds the size limit. Open the full output data in a text editor
| epoch 1 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 2 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 3 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 4 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 5 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 6 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 7 | iter 1 / 2 | time 0[s] | loss 1.95
| epoch 8 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 9 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 10 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 11 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 12 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 13 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 14 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 15 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 16 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 17 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 18 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 19 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 20 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 21 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 22 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 23 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 24 | iter 1 / 2 | time 0[s] | loss 1.94
| epoch 25 | iter 1 / 2 | time 0[s] | loss 1.94

and [-0.95682913 1.5921552 -0.9855902 -1.0069962 -1.570275 ]
i [0.82757306 0.49116576 0.8163651 0.8544265 0.8013826 ]
hello [0.9561777 1.4145606 0.95824456 0.97314036 1.1887612 ]
. [-1.0302204 -1.3671738 -0.98632777 -1.0349017 1.4386774 ]

3.5word2vec的补充说明

3.5.1CBOW模型和概率

后验概率就是数学中所学的条件概率。
注意CBOW模型中交叉熵损失的公式推导,学习任务是使之尽可能小。

3.5.2skip-gram模型

用目标词推上下文。
注意skip-gram模型中交叉熵损失的公式推导。
skip-gram的结果更好,CBOW更快。

3.5.3基于计数与基于推理

基于计数,靠相似性;基于推理,靠相似性和模式;两者结果难分上下,互通。

3.6小节

这一章我们学习了基于推理的词向量的表示方法——word2vec,以上下文推中间词的模型CBOW为例子,进行了代码的学习。对于以中间词推上下文的模型skip-gram也进行了原理方面的学习和了解。和基于计数的方法相比,基于推理的方法的进步表现在对数据模式的识别和表示,但其目前的计算速度方面还是存在很大的问题,下一章将对相关技术进行进一步的计算和讨论。

四 word2vec的高速化

上一章中CBOW模型的问题是随词汇量增加,计算量也会增加,导致时间成本高。本章加速会从两个方面进行介绍:1Embedding层的引入;2Negative Sampling的新损失函数的引入。

4.1word2vec的改进1

输入层的one-hot表示和权重矩阵Win的乘积->Embedding
中间层和权重矩阵Wout的乘积以及Softmax层的计算->Negative Sampling
本节介绍Embedding的引入。

4.1.1Embedding层

把MatMul层换成Embedding层,单词的密集向量表示称为词嵌入(Embedding)。

4.1.2Embedding层的实现

import numpy as np
W=np.arange(21).reshape(7,3)
print(W)

输出为:[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[15 16 17]
[18 19 20]]

print(W[2])
# 输出为:[6 7 8]
print(W[5])
# 输出为:[15 16 17]
idx=np.array([1,0,3,0])
print(W[idx])
"""
输出为:[[ 3 4 5]
[ 0 1 2]
[ 9 10 11]
[ 0 1 2]]
"""
class Embedding:
    def __init__(self,W):
        self.params=[W]
        self.grads=[np.zeros_like(W)]
        self.idx=None

    def forward(self,idx):
        W,=self.params
        self.idx=idx
        out=W[idx]
        return out
    
    def backward(self,dout):
        dW=self.grads
        dW[...]=0

        for i,word in enumerate(self.idx):
            dW[word_id]+=dout[i]
        return None

4.2word2vec的改进2

用负采样(Negative Sampling)替代Softmax,可以使得无论词汇量多大,计算量保持较低或恒定。

4.2.1中间层之后的计算问题

两个问题:
1、中间层的神经元和权重矩阵(Wout)的乘积
2、Softmax层的计算

4.2.2从多分类到二分类

负采样的中心思想是用二分类拟合多分类。
举例:让神经网络回答当上下文是you和goodbye时,目标词是say吗?
回答只需要提取say对应的列(单词向量),并用它与中间层的神经元计算内积即可。

4.2.3sigmoid函数和交叉熵误差

多分类:Softmax函数+交叉熵误差
二分类:Sigmoid函数+交叉熵误差
二分类和多分类交叉熵损失本质一样。

4.2.4从多分类到二分类的实现

1、输入层embedding替换了matmul
2、输出层embedding替换了matmul;sigmoid替换了softmax

class EmbeddingDot:
    def __init__(self, W):
        self.embed = Embedding(W)
        self.params = self.embed.params
        self.grads = self.embed.grads
        self.cache = None

    def forward(self, h, idx):
        target_W = self.embed.forward(idx)
        out = np.sum(target_W * h, axis=1)

        self.cache = (h, target_W)
        return out

    def backward(self, dout):
        h, target_W = self.cache
        dout = dout.reshape(dout.shape[0], 1)

        dtarget_W = dout * h
        self.embed.backward(dtarget_W)
        dh = dout * target_W
        return dh

4.2.5负采样

当前神经网络只是学习了正例say。
这一节的目标是负例。所有负例都值得学习吗?答案是否定的,只使用少数负例是负采样方法的含义。
当然,负采样也需要加上负例的损失。

4.2.6负采样的采样方法

考虑到计算复杂度,有必要将负例限定在较小范围内(5个或者10个)。

import numpy as np
np.random.choice(10)
# 输出为:3
words=['you','say','goodbye','I','hellow','.']
np.random.choice(words)
# 输出为:‘say’

np.random.choice(words,size=5)
# 输出为:array([‘hellow’, ‘.’, ‘hellow’, ‘I’, ‘I’], dtype=‘<U7’)

np.random.choice(words,size=5,replace=False)

# 输出为:array([‘.’, ‘say’, ‘goodbye’, ‘hellow’, ‘you’], dtype=‘<U7’)

p=[0.5,0.1,0.05,0.2,0.05,0.1]
np.random.choice(words,p=p)
# 输出为:‘you’

p=[0.7,0.29,0.01]
new_p=np.power(p,0.75)
new_p/=np.sum(new_p)
print(new_p)
# 输出为:[0.64196878 0.33150408 0.02652714]
`

```python
import collections
class UnigramSampler:
    def __init__(self, corpus, power, sample_size):
        self.sample_size = sample_size
        self.vocab_size = None
        self.word_p = None

        counts = collections.Counter()
        for word_id in corpus:
            counts[word_id] += 1

        vocab_size = len(counts)
        self.vocab_size = vocab_size

        self.word_p = np.zeros(vocab_size)
        for i in range(vocab_size):
            self.word_p[i] = counts[i]

        self.word_p = np.power(self.word_p, power)
        self.word_p /= np.sum(self.word_p)

    def get_negative_sample(self, target):
        batch_size = target.shape[0]


        negative_sample = np.zeros((batch_size, self.sample_size), dtype=np.int32)

        for i in range(batch_size):
            p = self.word_p.copy()
            target_idx = target[i]
            p[target_idx] = 0
            p /= p.sum()
            negative_sample[i, :] = np.random.choice(self.vocab_size, size=self.sample_size, replace=False, p=p)

        return negative_sample

corpus=np.array([0,1,2,3,4,1,2,3])
power=0.75
sample_size=2

sampler=UnigramSampler(corpus,power,sample_size)
target=np.array([1,3,0])
negative_sample=sampler.get_negative_sample(target)
print(negative_sample)

输出为:[[4 3]
[2 1]
[4 3]]

4.2.7负采样的实现

class SigmoidWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.loss = None
        self.y = None  
        self.t = None  

    def forward(self, x, t):
        self.t = t
        self.y = 1 / (1 + np.exp(-x))

        self.loss = cross_entropy_error(np.c_[1 - self.y, self.y], self.t)

        return self.loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        dx = (self.y - self.t) * dout / batch_size
        return dx

class NegativeSamplingLoss:
    def __init__(self, W, corpus, power=0.75, sample_size=5):
        self.sample_size = sample_size
        self.sampler = UnigramSampler(corpus, power, sample_size)
        self.loss_layers = [SigmoidWithLoss() for _ in range(sample_size + 1)]
        self.embed_dot_layers = [EmbeddingDot(W) for _ in range(sample_size + 1)]

        self.params, self.grads = [], []
        for layer in self.embed_dot_layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, h, target):
        batch_size = target.shape[0]
        negative_sample = self.sampler.get_negative_sample(target)

        score = self.embed_dot_layers[0].forward(h, target)
        correct_label = np.ones(batch_size, dtype=np.int32)
        loss = self.loss_layers[0].forward(score, correct_label)

        negative_label = np.zeros(batch_size, dtype=np.int32)
        for i in range(self.sample_size):
            negative_target = negative_sample[:, i]
            score = self.embed_dot_layers[1 + i].forward(h, negative_target)
            loss += self.loss_layers[1 + i].forward(score, negative_label)

        return loss

    def backward(self, dout=1):
        dh = 0
        for l0, l1 in zip(self.loss_layers, self.embed_dot_layers):
            dscore = l0.backward(dout)
            dh += l1.backward(dscore)

        return dh

4.3改进版word2vec的学习

4.3.1CBOW模型的实现

class CBOW:
    def __init__(self, vocab_size, hidden_size, window_size, corpus):
        V, H = vocab_size, hidden_size

        W_in = 0.01 * np.random.randn(V, H).astype('f')
        W_out = 0.01 * np.random.randn(V, H).astype('f')

        self.in_layers = []
        for i in range(2 * window_size):
            layer = Embedding(W_in)  
            self.in_layers.append(layer)
        self.ns_loss = NegativeSamplingLoss(W_out, corpus, power=0.75, sample_size=5)

        layers = self.in_layers + [self.ns_loss]
        self.params, self.grads = [], []
        for layer in layers:
            self.params += layer.params
            self.grads += layer.grads

        self.word_vecs = W_in

    def forward(self, contexts, target):
        h = 0
        for i, layer in enumerate(self.in_layers):
            h += layer.forward(contexts[:, i])
        h *= 1 / len(self.in_layers)
        loss = self.ns_loss.forward(h, target)
        return loss

    def backward(self, dout=1):
        dout = self.ns_loss.backward(dout)
        dout *= 1 / len(self.in_layers)
        for layer in self.in_layers:
            layer.backward(dout)
        return None

4.3.2CBOW模型的学习代码

import pickle
window_size = 3
hidden_size = 100
batch_size = 100
max_epoch = 10

corpus, word_to_id, id_to_word = ptb.load_data('train')
vocab_size = len(word_to_id)

contexts, target = create_contexts_target(corpus, window_size)

model = CBOW(vocab_size, hidden_size, window_size, corpus)
# model = SkipGram(vocab_size, hidden_size, window_size, corpus)
optimizer = Adam()
trainer = Trainer(model, optimizer)

trainer.fit(contexts, target, max_epoch, batch_size)
trainer.plot()

word_vecs = model.word_vecs

params = {}
params['word_vecs'] = word_vecs.astype(np.float16)
params['word_to_id'] = word_to_id
params['id_to_word'] = id_to_word
pkl_file = 'cbow_params.pkl'  # or 'skipgram_params.pkl'
with open(pkl_file, 'wb') as f:
    pickle.dump(params, f, -1)

4.3.3CBOW模型的评价

def normalize(x):
    if x.ndim == 2:
        s = np.sqrt((x * x).sum(1))
        x /= s.reshape((s.shape[0], 1))
    elif x.ndim == 1:
        s = np.sqrt((x * x).sum())
        x /= s
    return x

def analogy(a, b, c, word_to_id, id_to_word, word_matrix, top=5, answer=None):
    for word in (a, b, c):
        if word not in word_to_id:
            print('%s is not found' % word)
            return

    print('\n[analogy] ' + a + ':' + b + ' = ' + c + ':?')
    a_vec, b_vec, c_vec = word_matrix[word_to_id[a]], word_matrix[word_to_id[b]], word_matrix[word_to_id[c]]
    query_vec = b_vec - a_vec + c_vec
    query_vec = normalize(query_vec)

    similarity = np.dot(word_matrix, query_vec)

    if answer is not None:
        print("==>" + answer + ":" + str(np.dot(word_matrix[word_to_id[answer]], query_vec)))

    count = 0
    for i in (-1 * similarity).argsort():
        if np.isnan(similarity[i]):
            continue
        if id_to_word[i] in (a, b, c):
            continue
        print(' {0}: {1}'.format(id_to_word[i], similarity[i]))

        count += 1
        if count >= top:
            return

pkl_file = 'cbow_params.pkl'
# pkl_file = 'skipgram_params.pkl'

with open(pkl_file, 'rb') as f:
    params = pickle.load(f)
    word_vecs = params['word_vecs']
    word_to_id = params['word_to_id']
    id_to_word = params['id_to_word']

# most similar task
querys = ['you', 'year', 'car', 'toyota']
for query in querys:
    most_similar(query, word_to_id, id_to_word, word_vecs, top=5)

# analogy task
print('-'*50)
analogy('king', 'man', 'queen',  word_to_id, id_to_word, word_vecs)
analogy('take', 'took', 'go',  word_to_id, id_to_word, word_vecs)
analogy('car', 'cars', 'child',  word_to_id, id_to_word, word_vecs)
analogy('good', 'better', 'bad',  word_to_id, id_to_word, word_vecs)

4.4word2vec相关的其他话题

4.4.1word2vec的应用例

单词的分布式表示可以用于迁移学习,进而实现文本分类、文本聚类、词性标注和情感分析等自然语言处理任务。
单词的分布式表示的优点可以将单词转化为固定长度的向量。

4.4.2单词向量的评价方法

评价指标有相似度和类推问题。
相关结论:
1、模型不同,精度不同(根据语料库选择最佳的模型)。
2、语料库越大,结果越好(始终需要大数据)。
3、单词向量的维度必须适中。

4.5小结

本章主要针对CBOW的输入输出的高速化进行了优化。
针对输入部分,采用了embedding层。
针对输出部分,采用了负采样损失。
总体的思想是抽取,以实现高速化。

五 RNN

RNN是为了改进前馈网络而生的。

5.1概率和语言模型

5.1.1概率视角下的word2vec

这个部分复习了CBOW模型(由上下文预测中间词),引出了仅有左侧窗口上下文的概率和损失函数,进而引出语言模型。

5.1.2语言模型

语言模型给出了单词序列发生的概率,例子包括机器翻译、语音识别和生成新的句子。
多个时间一起发生的概率称为联合概率,联合概率可以转化为多个概率的乘积。
后验概率是以目标此左侧的全部单词为上下文时的概率。
图5-3表明了语言模型中的后验概率:若以t个单词为目标词,则第t个单词左侧的全部单词构成上下文(条件)。因此语言模型又被称作条件语言模型。

5.1.3将CBOW模型用作语言模型?

马尔可夫性指未来的状态仅依赖于当前的状态。
若用CBOW模型当语言模型会带来两个问题:1忽略了上下文单词顺序问题;2采用拼接的方式,参数量随上下文长度成比例增加。
而RNN能很好地解决这两个问题。它可以处理任意长度的时序数据。

5.2RNN

5.2.1循环的神经网络

循环是“反复并持续”的意思。
RNN一边记住过去的数据,一边更新到最新的数据。
RNN层拥有环路,输出中的一个分叉将成为其自身的输入。

5.2.2展开循环

多个RNN层都是同一层,各个时刻的RNN层接受传给该层的输入和前一个RNN的输出。
RNN是不断地进行输出和输入的(更新)。

5.2.3Backpropagation Through Time

BPTT是按时间顺序展开的神经网络的误差反向传播法。

5.2.4Truncated BPTT

截断的BPTT,只是网络的反向误差被截断,以被截出的网络为单位进行学习。
即正向传播不截断,反向传播为求误差而截断。

5.3RNN的实现

单步处理的层称为RNN层,一次处理T步的层称为Time RNN层。

5.3.1RNN层的实现

import numpy as np

class RNN:
    def __init__(self, Wx, Wh, b):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.cache = None

    def forward(self, x, h_prev):
        Wx, Wh, b = self.params
        t = np.dot(h_prev, Wh) + np.dot(x, Wx) + b
        h_next = np.tanh(t)

        self.cache = (x, h_prev, h_next)
        return h_next

    def backward(self, dh_next):
        Wx, Wh, b = self.params
        x, h_prev, h_next = self.cache

        dt = dh_next * (1 - h_next ** 2)
        db = np.sum(dt, axis=0)
        dWh = np.dot(h_prev.T, dt)
        dh_prev = np.dot(dt, Wh.T)
        dWx = np.dot(x.T, dt)
        dx = np.dot(dt, Wx.T)

        self.grads[0][...] = dWx
        self.grads[1][...] = dWh
        self.grads[2][...] = db

        return dx, dh_prev

5.3.2Time RNN层的实现

class TimeRNN:
    def __init__(self, Wx, Wh, b, stateful=False):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.layers = None

        self.h, self.dh = None, None
        self.stateful = stateful

    def forward(self, xs):
        Wx, Wh, b = self.params
        N, T, D = xs.shape
        D, H = Wx.shape

        self.layers = []
        hs = np.empty((N, T, H), dtype='f')

        if not self.stateful or self.h is None:
            self.h = np.zeros((N, H), dtype='f')

        for t in range(T):
            layer = RNN(*self.params)
            self.h = layer.forward(xs[:, t, :], self.h)
            hs[:, t, :] = self.h
            self.layers.append(layer)

        return hs

    def backward(self, dhs):
        Wx, Wh, b = self.params
        N, T, H = dhs.shape
        D, H = Wx.shape

        dxs = np.empty((N, T, D), dtype='f')
        dh = 0
        grads = [0, 0, 0]
        for t in reversed(range(T)):
            layer = self.layers[t]
            dx, dh = layer.backward(dhs[:, t, :] + dh)
            dxs[:, t, :] = dx

            for i, grad in enumerate(layer.grads):
                grads[i] += grad

        for i, grad in enumerate(grads):
            self.grads[i][...] = grad
        self.dh = dh

        return dxs

    def set_state(self, h):
        self.h = h

    def reset_state(self):
        self.h = None

5.4处理时序数据的层的实现

基于RNN的语言模型称为RNNLM。

5.4.1RNNLM的全貌图

RNNLM可以“记忆”目前为止输入的单词,并以此为基础预测接下来会出现的单词。

5.4.2Time层的实现

含有T个时序数据的层称为“Time XX层。
Time的损失是平均损失。

5.5RNNLM的学习和评价

5.5.1RNNLM的实现

class TimeEmbedding:
    def __init__(self, W):
        self.params = [W]
        self.grads = [np.zeros_like(W)]
        self.layers = None
        self.W = W

    def forward(self, xs):
        N, T = xs.shape
        V, D = self.W.shape

        out = np.empty((N, T, D), dtype='f')
        self.layers = []

        for t in range(T):
            layer = Embedding(self.W)
            out[:, t, :] = layer.forward(xs[:, t])
            self.layers.append(layer)

        return out

    def backward(self, dout):
        N, T, D = dout.shape

        grad = 0
        for t in range(T):
            layer = self.layers[t]
            layer.backward(dout[:, t, :])
            grad += layer.grads[0]

        self.grads[0][...] = grad
        return None


class TimeAffine:
    def __init__(self, W, b):
        self.params = [W, b]
        self.grads = [np.zeros_like(W), np.zeros_like(b)]
        self.x = None

    def forward(self, x):
        N, T, D = x.shape
        W, b = self.params

        rx = x.reshape(N*T, -1)
        out = np.dot(rx, W) + b
        self.x = x
        return out.reshape(N, T, -1)

    def backward(self, dout):
        x = self.x
        N, T, D = x.shape
        W, b = self.params

        dout = dout.reshape(N*T, -1)
        rx = x.reshape(N*T, -1)

        db = np.sum(dout, axis=0)
        dW = np.dot(rx.T, dout)
        dx = np.dot(dout, W.T)
        dx = dx.reshape(*x.shape)

        self.grads[0][...] = dW
        self.grads[1][...] = db

        return dx


class TimeSoftmaxWithLoss:
    def __init__(self):
        self.params, self.grads = [], []
        self.cache = None
        self.ignore_label = -1

    def forward(self, xs, ts):
        N, T, V = xs.shape

        if ts.ndim == 3:  
            ts = ts.argmax(axis=2)

        mask = (ts != self.ignore_label)

        xs = xs.reshape(N * T, V)
        ts = ts.reshape(N * T)
        mask = mask.reshape(N * T)

        ys = softmax(xs)
        ls = np.log(ys[np.arange(N * T), ts])
        ls *= mask
        loss = -np.sum(ls)
        loss /= mask.sum()

        self.cache = (ts, ys, mask, (N, T, V))
        return loss

    def backward(self, dout=1):
        ts, ys, mask, (N, T, V) = self.cache

        dx = ys
        dx[np.arange(N * T), ts] -= 1
        dx *= dout
        dx /= mask.sum()
        dx *= mask[:, np.newaxis]

        dx = dx.reshape((N, T, V))

        return dx


class SimpleRnnlm:
    def __init__(self, vocab_size, wordvec_size, hidden_size):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        rnn_Wx = (rn(D, H) / np.sqrt(D)).astype('f')
        rnn_Wh = (rn(H, H) / np.sqrt(H)).astype('f')
        rnn_b = np.zeros(H).astype('f')
        affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.layers = [
            TimeEmbedding(embed_W),
            TimeRNN(rnn_Wx, rnn_Wh, rnn_b, stateful=True),
            TimeAffine(affine_W, affine_b)
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.rnn_layer = self.layers[1]

        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def forward(self, xs, ts):
        for layer in self.layers:
            xs = layer.forward(xs)
        loss = self.loss_layer.forward(xs, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        self.rnn_layer.reset_state()

5.5.2语言模型的评价

困惑度表示概率的倒数。困惑度越小越好。
分叉度是指下一个可以选择的选项的数量。
对于语言模型的评价,困惑度越小,分叉都越小,表明模型越好。

5.5.3RNNLM的学习代码

class SGD:

    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for i in range(len(params)):
            params[i] -= self.lr * grads[i]
batch_size = 10
wordvec_size = 100
hidden_size = 100
time_size = 5 
lr = 0.1
max_epoch = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)

xs = corpus[:-1]  
ts = corpus[1:]  
data_size = len(xs)
print('corpus size: %d, vocabulary size: %d' % (corpus_size, vocab_size))

max_iters = data_size // (batch_size * time_size)
time_idx = 0
total_loss = 0
loss_count = 0
ppl_list = []

model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)

jump = (corpus_size - 1) // batch_size
offsets = [i * jump for i in range(batch_size)]

for epoch in range(max_epoch):
    for iter in range(max_iters):

        batch_x = np.empty((batch_size, time_size), dtype='i')
        batch_t = np.empty((batch_size, time_size), dtype='i')
        for t in range(time_size):
            for i, offset in enumerate(offsets):
                batch_x[i, t] = xs[(offset + time_idx) % data_size]
                batch_t[i, t] = ts[(offset + time_idx) % data_size]
            time_idx += 1

        loss = model.forward(batch_x, batch_t)
        model.backward()
        optimizer.update(model.params, model.grads)
        total_loss += loss
        loss_count += 1

    ppl = np.exp(total_loss / loss_count)
    print('| epoch %d | perplexity %.2f'
          % (epoch+1, ppl))
    ppl_list.append(float(ppl))
    total_loss, loss_count = 0, 0

5.5.4RNNLM的Trainer类

class RnnlmTrainer:
    def __init__(self, model, optimizer):
        self.model = model
        self.optimizer = optimizer
        self.time_idx = None
        self.ppl_list = None
        self.eval_interval = None
        self.current_epoch = 0

    def get_batch(self, x, t, batch_size, time_size):
        batch_x = np.empty((batch_size, time_size), dtype='i')
        batch_t = np.empty((batch_size, time_size), dtype='i')

        data_size = len(x)
        jump = data_size // batch_size
        offsets = [i * jump for i in range(batch_size)] 

        for time in range(time_size):
            for i, offset in enumerate(offsets):
                batch_x[i, time] = x[(offset + self.time_idx) % data_size]
                batch_t[i, time] = t[(offset + self.time_idx) % data_size]
            self.time_idx += 1
        return batch_x, batch_t

    def fit(self, xs, ts, max_epoch=10, batch_size=20, time_size=35,
            max_grad=None, eval_interval=20):
        data_size = len(xs)
        max_iters = data_size // (batch_size * time_size)
        self.time_idx = 0
        self.ppl_list = []
        self.eval_interval = eval_interval
        model, optimizer = self.model, self.optimizer
        total_loss = 0
        loss_count = 0

        start_time = time.time()
        for epoch in range(max_epoch):
            for iters in range(max_iters):
                batch_x, batch_t = self.get_batch(xs, ts, batch_size, time_size)

                loss = model.forward(batch_x, batch_t)
                model.backward()
                params, grads = remove_duplicate(model.params, model.grads)
                if max_grad is not None:
                    clip_grads(grads, max_grad)
                optimizer.update(params, grads)
                total_loss += loss
                loss_count += 1

                if (eval_interval is not None) and (iters % eval_interval) == 0:
                    ppl = np.exp(total_loss / loss_count)
                    elapsed_time = time.time() - start_time
                    print('| epoch %d |  iter %d / %d | time %d[s] | perplexity %.2f'
                          % (self.current_epoch + 1, iters + 1, max_iters, elapsed_time, ppl))
                    self.ppl_list.append(float(ppl))
                    total_loss, loss_count = 0, 0

            self.current_epoch += 1

batch_size = 10
wordvec_size = 100
hidden_size = 100 
time_size = 5 
lr = 0.1
max_epoch = 100

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_size = 1000 
corpus = corpus[:corpus_size]
vocab_size = int(max(corpus) + 1)
xs = corpus[:-1]
ts = corpus[1:] 

model = SimpleRnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

trainer.fit(xs, ts, max_epoch, batch_size, time_size)

5.6 小节

本章介绍了RNN、Time RNN和RNNLM。学习重点是RNN的原理及对循环结构的理解。

六 Gated RNN

RNN无法很好的学习到时序数据的长期依赖关系。
LSTM和GRU的门结构可以学习到时序数据的长期依赖关系。

6.1 RNN的问题

BPTT会发生梯度消失和梯度爆炸的问题。

6.1.1 RNN的复习

复习了RNN的结构图和计算图。

6.1.2 梯度消失和梯度爆炸

随着时间的回溯,简单RNN未能避免梯度变小或梯度变大。

6.1.3 梯度消失和梯度爆炸的原因

产生梯度消失的原因是激活函数tanh,换成ReLU可抑制这种梯度消失。
L2范数是均方误差。

N = 2  
H = 3  
T = 20  

dh = np.ones((N, H))

np.random.seed(3)

Wh = np.random.randn(H, H)
#Wh = np.random.randn(H, H) * 0.5

norm_list = []
for t in range(T):
    dh = np.dot(dh, Wh.T)
    norm = np.sqrt(np.sum(dh**2)) / N
    norm_list.append(norm)

print(norm_list)

输出为:[2.4684068094579303, 3.3357049741610365, 4.783279375373182, 6.279587332087612, 8.080776465019053, 10.251163032292936, 12.936063506609896, 16.276861327786712, 20.45482961834598, 25.688972842084684, 32.25315718048336, 40.48895641683869, 50.8244073070191, 63.79612654485427, 80.07737014308985, 100.5129892205125, 126.16331847536823, 158.35920648258823, 198.7710796761195, 249.495615421267]
这里输出的梯度随时间呈指数级增加,因此产生梯度爆炸。

N = 2  
H = 3  
T = 20  

dh = np.ones((N, H))

np.random.seed(3)

#Wh = np.random.randn(H, H)
Wh = np.random.randn(H, H) * 0.5

norm_list = []
for t in range(T):
    dh = np.dot(dh, Wh.T)
    norm = np.sqrt(np.sum(dh**2)) / N
    norm_list.append(norm)

print(norm_list)

输出为:[1.2342034047289652, 0.8339262435402591, 0.5979099219216477, 0.39247420825547574, 0.2525242645318454, 0.16017442237957713, 0.10106299614538981, 0.06358148956166684, 0.03995083909833199, 0.025086887541098325, 0.015748611904532892, 0.009884999125204758, 0.006204151282595105, 0.003893806551809953, 0.002443767399386287, 0.0015337065005571365, 0.0009625497320203265, 0.0006040924319556741, 0.00037912574706291106, 0.00023793756048323344]
这里输出的梯度随时间呈指数级减小,因此产生梯度消失。
矩阵Wh被反复乘了T次,奇异值表示数据的离散程度。奇异值>1,则梯度爆炸;奇异值<1,则梯度消失。

6.1.4 梯度爆炸的对策

要进行threshold,再据它做梯度裁剪。

import numpy as np


dW1 = np.random.rand(3, 3) * 10
dW2 = np.random.rand(3, 3) * 10
grads = [dW1, dW2]
max_norm = 5.0


def clip_grads(grads, max_norm):
    total_norm = 0
    for grad in grads:
        total_norm += np.sum(grad ** 2)
    total_norm = np.sqrt(total_norm)

    rate = max_norm / (total_norm + 1e-6)
    if rate < 1:
        for grad in grads:
            grad *= rate

print('before:', dW1.flatten())
clip_grads(grads, max_norm)
print('after:', dW1.flatten())

输出为:before: [6.49144048 2.78487283 6.76254902 5.90862817 0.23981882 5.58854088
2.59252447 4.15101197 2.83525082]
after: [1.49503731 0.64138134 1.55747605 1.36081038 0.05523244 1.28709139
0.59708178 0.95601551 0.65298384]
输入梯度和阈值后,输出为裁剪了的梯度。
梯度裁剪能解决梯度爆炸的问题。

6.2 梯度消失和LSTM

6.2.1 LSTM的接口

LSTM和RNN的不同在于多了记忆单元c(记忆单元),其特点是仅在LSTM层内部接受和传递数据。

6.2.2 LSTM层的结构

仍在每层输出隐藏状态,记忆单元只在层与层之间传递,但不输出。
ct和ht的数量相等。
门的开合程度也是从数据中学习的。

6.2.3 输出门

这里考虑对tanh(ct)施加门。针对tanh(ct)的各个元素,调整它们作为下一时刻的隐藏状态的重要程度。这个门称为输出门。输出门的开合程度根据输入xt和上一个状态ht-1求出。
tanh的值域在-1到1,表示某种被编码信息的强弱。sigmoid的值域在0到1,表示数据流出的比例。

6.2.4 遗忘门

在记忆单元ct-1上添加一个忘记不必要记忆的门,就是遗忘门。

6.2.5 新的记忆单元

向记忆单元添加一些应当记住的新信息,添加新的tanh节点。

6.2.6 输入门

给g添加的门称为输出门。

6.2.7 LSTM的梯度的流动

LSTM不会引起梯度消失(爆炸)的原因:反向传播不进行矩阵乘积运算,而是进行元素乘积运算(阿达玛积)。
LSTM是可以长时间维持的短期记忆。

6.3 LSTM的实现

LSTM和RNN实现的不同是多了记忆单元:

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class LSTM:
    def __init__(self, Wx, Wh, b):

        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.cache = None

    def forward(self, x, h_prev, c_prev):
        Wx, Wh, b = self.params
        N, H = h_prev.shape

        A = np.dot(x, Wx) + np.dot(h_prev, Wh) + b

        f = A[:, :H]
        g = A[:, H:2*H]
        i = A[:, 2*H:3*H]
        o = A[:, 3*H:]

        f = sigmoid(f)
        g = np.tanh(g)
        i = sigmoid(i)
        o = sigmoid(o)

        c_next = f * c_prev + g * i
        h_next = o * np.tanh(c_next)

        self.cache = (x, h_prev, c_prev, i, f, g, o, c_next)
        return h_next, c_next

    def backward(self, dh_next, dc_next):
        Wx, Wh, b = self.params
        x, h_prev, c_prev, i, f, g, o, c_next = self.cache

        tanh_c_next = np.tanh(c_next)

        ds = dc_next + (dh_next * o) * (1 - tanh_c_next ** 2)

        dc_prev = ds * f

        di = ds * g
        df = ds * c_prev
        do = dh_next * tanh_c_next
        dg = ds * i

        di *= i * (1 - i)
        df *= f * (1 - f)
        do *= o * (1 - o)
        dg *= (1 - g ** 2)

        dA = np.hstack((df, dg, di, do))

        dWh = np.dot(h_prev.T, dA)
        dWx = np.dot(x.T, dA)
        db = dA.sum(axis=0)

        self.grads[0][...] = dWx
        self.grads[1][...] = dWh
        self.grads[2][...] = db

        dx = np.dot(dA, Wx.T)
        dh_prev = np.dot(dA, Wh.T)

        return dx, dh_prev, dc_prev

TimeLSTM由t个LSTM组成:

class TimeLSTM:
    def __init__(self, Wx, Wh, b, stateful=False):
        self.params = [Wx, Wh, b]
        self.grads = [np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(b)]
        self.layers = None

        self.h, self.c = None, None
        self.dh = None
        self.stateful = stateful

    def forward(self, xs):
        Wx, Wh, b = self.params
        N, T, D = xs.shape
        H = Wh.shape[0]

        self.layers = []
        hs = np.empty((N, T, H), dtype='f')

        if not self.stateful or self.h is None:
            self.h = np.zeros((N, H), dtype='f')
        if not self.stateful or self.c is None:
            self.c = np.zeros((N, H), dtype='f')

        for t in range(T):
            layer = LSTM(*self.params)
            self.h, self.c = layer.forward(xs[:, t, :], self.h, self.c)
            hs[:, t, :] = self.h

            self.layers.append(layer)

        return hs

    def backward(self, dhs):
        Wx, Wh, b = self.params
        N, T, H = dhs.shape
        D = Wx.shape[0]

        dxs = np.empty((N, T, D), dtype='f')
        dh, dc = 0, 0

        grads = [0, 0, 0]
        for t in reversed(range(T)):
            layer = self.layers[t]
            dx, dh, dc = layer.backward(dhs[:, t, :] + dh, dc)
            dxs[:, t, :] = dx
            for i, grad in enumerate(layer.grads):
                grads[i] += grad

        for i, grad in enumerate(grads):
            self.grads[i][...] = grad
        self.dh = dh
        return dxs

    def set_state(self, h, c=None):
        self.h, self.c = h, c

    def reset_state(self):
        self.h, self.c = None, None

6.4 使用LSTM的语言模型

import os
def to_cpu(x):
    import numpy
    if type(x) == numpy.ndarray:
        return x
    return np.asnumpy(x)


def to_gpu(x):
    import cupy
    if type(x) == cupy.ndarray:
        return x
    return cupy.asarray(x)


class BaseModel:
    def __init__(self):
        self.params, self.grads = None, None

    def forward(self, *args):
        raise NotImplementedError

    def backward(self, *args):
        raise NotImplementedError

    def save_params(self, file_name=None):
        if file_name is None:
            file_name = self.__class__.__name__ + '.pkl'

        params = [p.astype(np.float16) for p in self.params]
        if GPU:
            params = [to_cpu(p) for p in params]

        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name=None):
        if file_name is None:
            file_name = self.__class__.__name__ + '.pkl'

        if '/' in file_name:
            file_name = file_name.replace('/', os.sep)

        if not os.path.exists(file_name):
            raise IOError('No file: ' + file_name)

        with open(file_name, 'rb') as f:
            params = pickle.load(f)

        params = [p.astype('f') for p in params]
        if GPU:
            params = [to_gpu(p) for p in params]

        for i, param in enumerate(self.params):
            param[...] = params[i]

class Rnnlm(BaseModel):
    def __init__(self, vocab_size=10000, wordvec_size=100, hidden_size=100):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b = np.zeros(4 * H).astype('f')
        affine_W = (rn(H, V) / np.sqrt(H)).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.layers = [
            TimeEmbedding(embed_W),
            TimeLSTM(lstm_Wx, lstm_Wh, lstm_b, stateful=True),
            TimeAffine(affine_W, affine_b)
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.lstm_layer = self.layers[1]

        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def predict(self, xs):
        for layer in self.layers:
            xs = layer.forward(xs)
        return xs

    def forward(self, xs, ts):
        score = self.predict(xs)
        loss = self.loss_layer.forward(score, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        self.lstm_layer.reset_state()

import sys
def eval_perplexity(model, corpus, batch_size=10, time_size=35):
    print('evaluating perplexity ...')
    corpus_size = len(corpus)
    total_loss = 0
    max_iters = (corpus_size - 1) // (batch_size * time_size)
    jump = (corpus_size - 1) // batch_size

    for iters in range(max_iters):
        xs = np.zeros((batch_size, time_size), dtype=np.int32)
        ts = np.zeros((batch_size, time_size), dtype=np.int32)
        time_offset = iters * time_size
        offsets = [time_offset + (i * jump) for i in range(batch_size)]
        for t in range(time_size):
            for i, offset in enumerate(offsets):
                xs[i, t] = corpus[(offset + t) % corpus_size]
                ts[i, t] = corpus[(offset + t + 1) % corpus_size]

        try:
            loss = model.forward(xs, ts, train_flg=False)
        except TypeError:
            loss = model.forward(xs, ts)
        total_loss += loss

        sys.stdout.write('\r%d / %d' % (iters, max_iters))
        sys.stdout.flush()

    print('')
    ppl = np.exp(total_loss / max_iters)
    return ppl


batch_size = 20
wordvec_size = 100
hidden_size = 100  
time_size = 35  
lr = 20.0
max_epoch = 4
max_grad = 0.25

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_test, _, _ = ptb.load_data('test')
vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]

model = Rnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

trainer.fit(xs, ts, max_epoch, batch_size, time_size, max_grad,
            eval_interval=20)
trainer.plot(ylim=(0, 500))

model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)

model.save_params()

eval_perplexity是评价困惑度的指标,随着训练轮次的增加,困惑度下降。

6.5 进一步改进RNNLM

6.5.1 LSTM层的多层化

需要构建高精度模型的时候,加深LSTM层是有效方法,不过层数也是超参数。

6.5.2 基于Dropout抑制过拟合

RNN比常规的前馈神经网络更容易发生过拟合。
抑制过拟合的三种方式:1增加训练数据 2降低模型复杂度 3对模型复杂度给予惩罚的正则化。
dropout属于第三种,它随机选择一部分神经元,然后忽略他们,可以提高神经网络的泛化能力。
对于LSTM,dropout不应插在时序方向上,而应插在深度方向上。
mask指的是决定是否传递数据的随即布尔值。

6.5.3 权重共享

使用权重共享只需将embedding层权重转置设置为affine层的权重。
共享权重可以减少需要学习的参数量。

6.5.4 更好的RNNLM的实现

class TimeDropout:
    def __init__(self, dropout_ratio=0.5):
        self.params, self.grads = [], []
        self.dropout_ratio = dropout_ratio
        self.mask = None
        self.train_flg = True

    def forward(self, xs):
        if self.train_flg:
            flg = np.random.rand(*xs.shape) > self.dropout_ratio
            scale = 1 / (1.0 - self.dropout_ratio)
            self.mask = flg.astype(np.float32) * scale

            return xs * self.mask
        else:
            return xs

    def backward(self, dout):
        return dout * self.mask

class BetterRnnlm(BaseModel):

    def __init__(self, vocab_size=10000, wordvec_size=650,
                 hidden_size=650, dropout_ratio=0.5):
        V, D, H = vocab_size, wordvec_size, hidden_size
        rn = np.random.randn

        embed_W = (rn(V, D) / 100).astype('f')
        lstm_Wx1 = (rn(D, 4 * H) / np.sqrt(D)).astype('f')
        lstm_Wh1 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b1 = np.zeros(4 * H).astype('f')
        lstm_Wx2 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_Wh2 = (rn(H, 4 * H) / np.sqrt(H)).astype('f')
        lstm_b2 = np.zeros(4 * H).astype('f')
        affine_b = np.zeros(V).astype('f')

        self.layers = [
            TimeEmbedding(embed_W),
            TimeDropout(dropout_ratio),
            TimeLSTM(lstm_Wx1, lstm_Wh1, lstm_b1, stateful=True),
            TimeDropout(dropout_ratio),
            TimeLSTM(lstm_Wx2, lstm_Wh2, lstm_b2, stateful=True),
            TimeDropout(dropout_ratio),
            TimeAffine(embed_W.T, affine_b)  # weight tying!!
        ]
        self.loss_layer = TimeSoftmaxWithLoss()
        self.lstm_layers = [self.layers[2], self.layers[4]]
        self.drop_layers = [self.layers[1], self.layers[3], self.layers[5]]

        self.params, self.grads = [], []
        for layer in self.layers:
            self.params += layer.params
            self.grads += layer.grads

    def predict(self, xs, train_flg=False):
        for layer in self.drop_layers:
            layer.train_flg = train_flg

        for layer in self.layers:
            xs = layer.forward(xs)
        return xs

    def forward(self, xs, ts, train_flg=True):
        score = self.predict(xs, train_flg)
        loss = self.loss_layer.forward(score, ts)
        return loss

    def backward(self, dout=1):
        dout = self.loss_layer.backward(dout)
        for layer in reversed(self.layers):
            dout = layer.backward(dout)
        return dout

    def reset_state(self):
        for layer in self.lstm_layers:
            layer.reset_state()

batch_size = 20
wordvec_size = 650
hidden_size = 650
time_size = 35
lr = 20.0
max_epoch = 40
max_grad = 0.25
dropout = 0.5

corpus, word_to_id, id_to_word = ptb.load_data('train')
corpus_val, _, _ = ptb.load_data('val')
corpus_test, _, _ = ptb.load_data('test')

if config.GPU:
    corpus = to_gpu(corpus)
    corpus_val = to_gpu(corpus_val)
    corpus_test = to_gpu(corpus_test)

vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]

model = BetterRnnlm(vocab_size, wordvec_size, hidden_size, dropout)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

best_ppl = float('inf')
for epoch in range(max_epoch):
    trainer.fit(xs, ts, max_epoch=1, batch_size=batch_size,
                time_size=time_size, max_grad=max_grad)

    model.reset_state()
    ppl = eval_perplexity(model, corpus_val)
    print('valid perplexity: ', ppl)

    if best_ppl > ppl:
        best_ppl = ppl
        model.save_params()
    else:
        lr /= 4.0
        optimizer.lr = lr

    model.reset_state()
    print('-' * 50)


model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)

以上代码将上面三个技巧(加深层,dropout,权重共享)进行了实现和在PTB数据上的实战。

6.6 小结

RNN常存在梯度消失和爆炸的问题。
梯度裁剪可以应对梯度爆炸。LSTM、GRU可以应对梯度消失。
相较于RNN,LSTM多了门的机制。LSTM中有三个门:输入门、遗忘门和输出门。门有专门的权重。
更好的语言模型的三个技巧(加深层,dropout和权重共享)得到了实现。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值