Language Modeling with Gated Convolutional Networks(句子建模之门控CNN）--Tensorflow实现篇

最新推荐文章于 2024-08-07 09:23:09 发布

liuchongee

最新推荐文章于 2024-08-07 09:23:09 发布

阅读量5k

点赞数 5

分类专栏：深度学习 nlp 卷积神经网络-CNN TensorFlow 深度学习与NLP--论文笔记和TensorFlow实现文章标签： cnn nlp

本文链接：https://blog.csdn.net/liuchonge/article/details/70254660

版权

本文介绍了使用简化版代码在Tensorflow中应用门控卷积网络进行语言建模，主要针对Google 1 Billion Word dataset。通过替换NCE损失为Adaptive Softmax、移除固定长度句子限制、应用Weight Normalisation加速收敛，以及在深层模型上大量训练以匹配论文结果。文章详细阐述了数据处理、模型构建和模型训练的流程，特别关注了5层卷积结果的block结构以及梯度裁剪技术。

摘要由CSDN通过智能技术生成

由于本篇论文所实现的任务十分耗时，我的小笔记本根本无法承受其计算量，而且他不像之前实现的模型都有明确的评价指标，所以并未亲自实现。在github上面找到了一个简化版的实现代码，该代码中在数据处理、模型评价等方面做了如下简化：

Replace NCE loss with Adaptive Softmax.
Remove restricted training on fixed sized sentences (20, for now) and extend to account for all varied sentence lenghts.
Implement Weight Normalisation for faster convergence.
Train extensively on deeper models to match the results with the paper.

其使用的是Google 1 Billion Word dataset数据集，该训练集中包含100个文件，每个文件中包含大概30万个句子，每个句子包含大概20个单词。所以总共有30301028个句子，one billion个tokens，800k个单词。可以说是最大的语言建模数据集。改代码做了简化，仅选取句长为18的句子作为训练集，并将句子进行padding：

if len(tokens) == conf.context_size-2:
                    words.extend((['<pad>']*(conf.filter_h/2)) + ['<s>'] + tokens + ['</s>'])

数据处理

import numpy as np
import collections
import os

def read_words(conf):
#读取所有训练集文件，将句长为18的句子作为训练集，并进行PADDING
    words = []
    for file in os.listdir(conf.data_dir):
        with open(os.path.join(conf.data_dir, file), 'r') as f:
            for line in f.readlines():
                tokens = line.split()
                # NOTE Currently, only sentences with a fixed size are chosen
                # to account for fixed convolutional layer size.
                if len(tokens) == conf.context_size-2:
                    words.extend((['<pad>']*(conf.filter_h/2)) + ['<s>'] + tokens + ['</s>'])
    return words

def index_words(words, conf):
    #选出出现次数最高的2000个单词作为vocabulary，这里也是一个大大的简化。直接将voca从800K降到了2000
    word_counter = collections.Counter(words).most_common(conf.vocab_size-1)
    word_to_idx = {
  '<unk>': 0}
    idx_to_word = {
  0: '<unk>'}
    for i,_ in enumerate(word_counter):
        word_to_idx[_[0]] = i+1
        idx_to_word[i+1] = _[0]
    data = []
    #将训练集中的单词转化为索引，以方便后续的嵌入层使用
    for word in words:
        idx = word_to_idx.get(