用RNN处理单词向量

最新推荐文章于 2024-09-06 17:34:22 发布

yybbxx

最新推荐文章于 2024-09-06 17:34:22 发布

阅读量2.1k

点赞数

分类专栏： Theano 机器学习 python 深度学习文章标签： word-embed theano python

Theano 同时被 3 个专栏收录

7 篇文章 0 订阅

订阅专栏

python

7 篇文章 0 订阅

订阅专栏

深度学习

7 篇文章 0 订阅

订阅专栏

总结

本文包含以下内容：

连接诶单词向量（Word Embedding）
使用RNN结构
使用内容窗口

代码-引用-参考

代码

代码下载

参考文献

Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio. Investigation of Recurrent-Neural-Network Architectures and Learning Methods for Spoken Language Understanding. Interspeech, 2013.
Gokhan Tur, Dilek Hakkani-Tur and Larry Heck. What is left to be understood in ATIS?
Christian Raymond and Giuseppe Riccardi. Generative and discriminative algorithms for spoken language understanding. Interspeech, 2007.
Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2012.
Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.

目的

分类任务：给出每句话中单词的含义

数据库

ATIS (Airline Travel Information System) dataset collected by DARPA.
ATIS数据库包含4978/893个句子，其中包括56590/9198个单词。单词的标签以IOB的形式给出。

RNN模型

原始输入编码

每个标记代表一个单词。ATIS将单词与转化为单词表的书号。每个句子都是int32的数组。例如

>>> sentence
array([383, 189,  13, 193, 208, 307, 195, 502, 260, 539,
        7,  60,  72, 8, 350, 384], dtype=int32)
>>> map(lambda x: index2word[x], sentence)
['please', 'find', 'a', 'flight', 'from', 'miami', 'florida',
        'to', 'las', 'vegas', '<UNK>', 'arriving', 'before', 'DIGIT', "o'clock", 'pm']

标签业已同样的方式与输入数据关联：

>>> labels
array([126, 126, 126, 126, 126,  48,  50, 126,  78, 123,  81, 126,  15,
        14,  89,  89], dtype=int32)
>>> map(lambda x: index2label[x], labels)
['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'B-fromloc.state_name',
        'O', 'B-toloc.city_name', 'I-toloc.city_name', 'B-toloc.state_name',
        'O', 'B-arrive_time.time_relative', 'B-arrive_time.time',
        'I-arrive_time.time', 'I-arrive_time.time']

内容窗口

内容窗口用来将句子中的单词转化为固定长度的数据序列，具体实现如下：

def contextwin(l, win):
    '''
    win :: int corresponding to the size of the window
    given a list of indexes composing a sentence

    l :: array containing the word indexes

    it will return a list of list of indexes corresponding
    to context windows surrounding each word in the sentence
    '''
    assert (win % 2) == 1
    assert win >= 1
    l = list(l)

    lpadded = win // 2 * [-1] + l + win // 2 * [-1]
    out = [lpadded[i:(i + win)] for i in range(len(l))]

    assert len(out) == len(l)
    return out

其中，-1作为PADDING来补充不足的数据，处理过后的数据如下：

>>> x
array([0, 1, 2, 3, 4], dtype=int32)
>>> contextwin(x, 3)
[[-1, 0, 1],
 [ 0, 1, 2],
 [ 1, 2, 3],
 [ 2, 3, 4],
 [ 3, 4,-1]]
>>> contextwin(x, 7)
[[-1, -1, -1, 0, 1, 2, 3],
 [-1, -1,  0, 1, 2, 3, 4],
 [-1,  0,  1, 2, 3, 4,-1],
 [ 0,  1,  2, 3, 4,-1,-1],
 [ 1,  2,  3, 4,-1,-1,-1]]

单词向量Word embeddings

经过内容窗口的处理后，单词和句子转化为了数组，接着需要和embedding关联起来。具体数显如下:

import theano, numpy
from theano import tensor as T

# nv :: size of our vocabulary
# de :: dimension of the embedding space
# cs :: context window size
nv, de, cs = 1000, 50, 5

embeddings = theano.shared(0.2 * numpy.random.uniform(-1.0, 1.0, \
    (nv+1, de)).astype(theano.config.floatX)) # add one for PADDING at the end

idxs = T.imatrix() # as many columns as words in the context window and as many lines as words in the sentence
x    = self.emb[idxs].reshape((idxs.shape[0], de*cs))

E-RNN

前面的处理将原始的输入处理为时序或者序列数据。E-RNN对当前数据（t）和上一时间节点数据（t-1）递归。
E-RNN要学习的数据包括：
- 单词向量（word embedding）
- 初始隐藏状态
- 输入一上一隐藏层的线性映射矩阵
- 偏移（可选）
- 顶层的softmax分类

RNN结构的全局变量定义为：
- 单词向量的维度
- 字典的大小
- 隐节点个数
- 分类个数
- 随机数种子和模型的初始化方法

具体实现如下：

class RNNSLU(object):
    ''' elman neural net model '''
    def __init__(self, nh, nc, ne, de, cs):
        '''
        nh :: dimension of the hidden layer
        nc :: number of classes
        ne :: number of word embeddings in the vocabulary
        de :: dimension of the word embeddings
        cs :: word window context size
        '''
        # parameters of the model
        self.emb = theano.shared(name='embeddings',
                                 value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                 (ne+1, de))
                                 # add one for padding at the end
                                 .astype(theano.config.floatX))
        self.wx = theano.shared(name='wx',
                                value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                (de * cs, nh))
                                .astype(theano.config.floatX))
        self.wh = theano.shared(name='wh',
                                value=0.2 * numpy.random.uniform(-1.0, 1.0,
                                (nh, nh))
                                .astype(theano.config.floatX))
        self.w = theano.shared(name='w',
                               value=0.2 * numpy.random.uniform(-1.0, 1.0,
                               (nh, nc))
                               .astype(theano.config.floatX))
        self.bh = theano.shared(name='bh',
                                value=numpy.zeros(nh,
                                dtype=theano.config.floatX))
        self.b = theano.shared(name='b',
                               value=numpy.zeros(nc,
                               dtype=theano.config.floatX))
        self.h0 = theano.shared(name='h0',
                                value=numpy.zeros(nh,
                                dtype=theano.config.floatX))

        # bundle
        self.params = [self.emb, self.wx, self.wh, self.w,
                       self.bh, self.b, self.h0]

接着，从单词向量中生成输入向量：

        idxs = T.imatrix()
        x = self.emb[idxs].reshape((idxs.shape[0], de*cs))
        y_sentence = T.ivector('y_sentence')  # labels

用theano.scan函数生成递归式：

        def recurrence(x_t, h_tm1):
            h_t = T.nnet.sigmoid(T.dot(x_t, self.wx)
                                 + T.dot(h_tm1, self.wh) + self.bh)
            s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b)
            return [h_t, s_t]

        [h, s], _ = theano.scan(fn=recurrence,
                                sequences=x,
                                outputs_info=[self.h0, None],
                                n_steps=x.shape[0])

        p_y_given_x_sentence = s[:, 0, :]
        y_pred = T.argmax(p_y_given_x_sentence, axis=1)

Theano会自动计算各个参数的梯度来最大化对数化的损失函数

        lr = T.scalar('lr')

        sentence_nll = -T.mean(T.log(p_y_given_x_sentence)
                               [T.arange(x.shape[0]), y_sentence])
        sentence_gradients = T.grad(sentence_nll, self.params)
        sentence_updates = OrderedDict((p, p - lr*g)
                                       for p, g in
                                       zip(self.params, sentence_gradients))

接着，将这些函数压缩在一起：

        self.classify = theano.function(inputs=[idxs], outputs=y_pred)
        self.sentence_train = theano.function(inputs=[idxs, y_sentence, lr],
                                              outputs=sentence_nll,
                                              updates=sentence_updates)

每次更新参数都要把单词向量归一化，使他们保持在单位球面上：

         self.normalize = theano.function(inputs=[],
                                         updates={self.emb:
                                                  self.emb /
                                                  T.sqrt((self.emb**2)
                                                  .sum(axis=1))
                                                  .dimshuffle(0, 'x')})

评估

评估参考真是标签与预测的标签的准确性。

训练

更新

本文使用批次SGD方法

停止标准

分离出一部分数据作为验证数据，始终保留最好的模型

全局函数的选择

learning rate : uniform([0.05,0.01])
window size : random value from {3,…,19}
number of hidden units : random value from {100,200}
embedding dimension : random value from {50,100}

运行代码

ython code/rnnslu.py

('NEW BEST: epoch', 25, 'valid F1', 96.84, 'best test F1', 93.79)
[learning] epoch 26 >> 100.00% completed in 28.76 (sec) <<
[learning] epoch 27 >> 100.00% completed in 28.76 (sec) <<
...
('BEST RESULT: epoch', 57, 'valid F1', 97.23, 'best test F1', 94.2, 'with the model', 'rnnslu')
Timing

效率

i7 CPU 950 @ 3.07GHz环境下，不多于40s，200M内存
###性能
python NEW BEST: epoch 28 valid F1 96.61 best test F1 94.19 NEW BEST: epoch 29 valid F1 96.63 best test F1 94.42 [learning] epoch 30 >> 100.00% completed in 35.04 (sec) << [learning] epoch 31 >> 100.00% completed in 34.80 (sec) << [...] NEW BEST: epoch 40 valid F1 97.25 best test F1 94.34 [learning] epoch 41 >> 100.00% completed in 35.18 (sec) << NEW BEST: epoch 42 valid F1 97.33 best test F1 94.48 [learning] epoch 43 >> 100.00% completed in 35.39 (sec) << [learning] epoch 44 >> 100.00% completed in 35.31 (sec) << [...]