由于本篇论文所实现的任务十分耗时,我的小笔记本根本无法承受其计算量,而且他不像之前实现的模型都有明确的评价指标,所以并未亲自实现。在github上面找到了一个简化版的实现代码,该代码中在数据处理、模型评价等方面做了如下简化:
- Replace NCE loss with Adaptive Softmax.
- Remove restricted training on fixed sized sentences (20, for now) and extend to account for all varied sentence lenghts.
- Implement Weight Normalisation for faster convergence.
- Train extensively on deeper models to match the results with the paper.
其使用的是Google 1 Billion Word dataset数据集,该训练集中包含100个文件,每个文件中包含大概30万个句子,每个句子包含大概20个单词。所以总共有30301028个句子,one billion个tokens,800k个单词。可以说是最大的语言建模数据集。改代码做了简化,仅选取句长为18的句子作为训练集,并将句子进行padding:
if len(tokens) == conf.context_size-2:
words.extend((['<pad>']*(conf.filter_h/2)) + ['<s>'] + tokens + ['</s>'])
数据处理
import numpy as np
import collections
import os
def read_words(conf):
#读取所有训练集文件,将句长为18的句子作为训练集,并进行PADDING
words = []
for file in os.listdir(conf.data_dir):
with open(os.path.join(conf.data_dir, file), 'r') as f:
for line in f.readlines():
tokens = line.split()
# NOTE Currently, only sentences with a fixed size are chosen
# to account for fixed convolutional layer size.
if len(tokens) == conf.context_size-2:
words.extend((['<pad>']*(conf.filter_h/2)) + ['<s>'] + tokens + ['</s>'])
return words
def index_words(words, conf):
#选出出现次数最高的2000个单词作为vocabulary,这里也是一个大大的简化。直接将voca从800K降到了2000
word_counter = collections.Counter(words).most_common(conf.vocab_size-1)
word_to_idx = {
'<unk>': 0}
idx_to_word = {
0: '<unk>'}
for i,_ in enumerate(word_counter):
word_to_idx[_[0]] = i+1
idx_to_word[i+1] = _[0]
data = []
#将训练集中的单词转化为索引,以方便后续的嵌入层使用
for word in words:
idx = word_to_idx.get(