本节将实现一个能够学习词向量的模型。对于NLP任务,这是一种表示词的强大方式。
作为语义关联问题的一个解决方案,依据共生关系表示单词的思路由来已久。这种方法的基本思路是,遍历一个大规模文本语料库,针对每个单词,统计其在一定距离范围内的周围词汇。然后,用附近词汇的规范化数量表示每个词语。这种方法背后的思想是在类似语境中使用的词语在语义上也是相似的。这样便可运用PCA或类似的方法对出现向量降维,从而得到更稠密的表示。虽然这种方法具有很好的性能,但它要求我们追踪所有词汇的共生矩阵,即一个宽度和高度均为词汇表长度的方阵。
2013年,Mikolov等提出了一种依据上下文计算此表示的实用有效的方法,skip-gram模型从随机表示开始,并拥有一个试图依据当前词语预测一个上下文词语的简单分类器。误差同时通过分类器权值和词的表示进行传播,需要对这两者进行调整以减少预测误差。研究发现,在大规模语料库上训练模型可表示向量逼近压缩后的共生向量。下面利用TensorFlow实现skip-gram模型。
Efficient estimation of word representations in vector space
准备维基百科语料库:使用英文维基百科转储文件。默认包含所有页面的完整修订历史,
https://dumps.wikimedia.org/backup-index.html
为了以正确的格式表示数据,还需执行若干步骤。数据收集和清洗是非常迫切和重要的任务。遍历表示为one-hot编码词语的维基页面。需要完成下列步骤:
1)下载转储文件,提取页面及其中的词语
2)统计词语的出现次数,构建一个由最常见词语构成的词汇表。
3)利用该词汇表对提取的页面进行编码。
模型结构:
噪声对比分类器:
训练模型:完整的语料库,https://dumps.wikimedia.org/enwiki/20160501/enwiki-20160501-pages-meta-current.xml.bz2
Wikepedia.py
import bz2
import collections
import os
import re
from lxml import etree
from helpers import download
class Wikipedia:
TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
def __init__(self, url, cache_dir, vocabulary_size=10000):
self._cache_dir = os.path.expanduser(cache_dir)
self._pages_path = os.path.join(self._cache_dir, 'pages.bz2')
self._vocabulary_path = os.path.join(self._cache_dir, 'vocabulary.bz2')
if not os.path.isfile(self._pages_path):
print('Read pages')
self._read_pages(url)
if not os.path.isfile(self._vocabulary_path):
print('Build vocabulary')
self._build_vocabulary(vocabulary_size)
with bz2.open(self._vocabulary_path, 'rt') as vocabulary:
print('Read vocabulary')
self._vocabulary = [x.strip() for x in vocabulary]
self._indices = {x: i for i, x in enumerate(self._vocabulary)}
def __iter__(self):
"""Iterate over pages represented as lists of word indices."""
with bz2.open(self._pages_path, 'rt') as pages:
for page in pages:
words = page.strip().split()
words = [self.encode(x) for x in words]
yield words
@property
def vocabulary_size(self):
return len(self._vocabulary)
def encode(self, word):
"""Get the vocabulary index of a string word."""
return self._indices.get(word, 0)
def decode(self, index):
"""Get back the string word from a vocabulary index."""
return self._vocabulary[index]
def _read_pages(self, url):
"""
Extract plain words from a Wikipedia dump and store them to the pages
file. Each page will be a line with words separated by spaces.
"""
wikipedia_path = download(url, self._cache_dir)
with bz2.open(wikipedia_path) as wikipedia, \
bz2.open(self._pages_path, 'wt') as pages:
for _, element in etree.iterparse(wikipedia, tag='{*}page'):
if element.find('./{*}redirect') is not None:
continue
page = element.findtext('./{*}revision/{*}text')
words = self._tokenize(page)
pages.write(' '.join(words) + '\n')
element.clear()
def _build_vocabulary(self, vocabulary_size):
"""
Count words in the pages file and write a list of the most frequent
words to the vocabulary file.
"""
counter = collections.Counter()
with bz2.open(self._pages_path, 'rt') as pages:
for page in pages:
words = page.strip().split()
counter.update(words)
common = ['<unk>'] + counter.most_common(vocabulary_size - 1)
common = [x[0] for x in common]
with bz2.open(self._vocabulary_path, 'wt') as vocabulary:
for word in common:
vocabulary.write(word + '\n')
@classmethod
def _tokenize(cls, page):
words = cls.TOKEN_REGEX.findall(page)
words = [x.lower() for x in words]
return words
batch.py
import numpy as np
def batched(iterator, batch_size):
"""Group a numerical stream into batches and yield them as Numpy arrays."""
while True:
data = np.zeros(batch_size)
target = np.zeros(batch_size)
for index in range(batch_size):
data[index], target[index] = next(iterator)
yield data, target
skipgrams.py
import random
def skipgrams(pages, max_context):
"""Form training pairs according to the skip-gram model."""
for words in pages:
for index, current in enumerate(words):
context = random.randint(1, max_context)
for target in words[max(0, index - context): index]:
yield current, target
for target in words[index + 1: index + context + 1]:
yield current, target
EmeddingModel.py
import tensorflow as tf
import numpy as np
from helpers import lazy_property
class EmbeddingModel:
def __init__(self, data, target, params):
self.data = data
self.target = target
self.params = params
self.embeddings
self.cost
self.optimize
@lazy_property
def embeddings(self):
initial = tf.random_uniform(
[self.params.vocabulary_size, self.params.embedding_size],
-1.0, 1.0)
return tf.Variable(initial)
@lazy_property
def optimize(self):
optimizer = tf.train.MomentumOptimizer(
self.params.learning_rate, self.params.momentum)
return optimizer.minimize(self.cost)
@lazy_property
def cost(self):
embedded = tf.nn.embedding_lookup(self.embeddings, self.data)
weight = tf.Variable(tf.truncated_normal(
[self.params.vocabulary_size, self.params.embedding_size],
stddev=1.0 / self.params.embedding_size ** 0.5))
bias = tf.Variable(tf.zeros([self.params.vocabulary_size]))
target = tf.expand_dims(self.target, 1)
return tf.reduce_mean(tf.nn.nce_loss(
weight, bias, embedded, target,
self.params.contrastive_examples,
self.params.vocabulary_size))