面向机器智能的TensorFlow实战7:词向量嵌入

    本节将实现一个能够学习词向量的模型。对于NLP任务,这是一种表示词的强大方式。

    作为语义关联问题的一个解决方案,依据共生关系表示单词的思路由来已久。这种方法的基本思路是,遍历一个大规模文本语料库,针对每个单词,统计其在一定距离范围内的周围词汇。然后,用附近词汇的规范化数量表示每个词语。这种方法背后的思想是在类似语境中使用的词语在语义上也是相似的。这样便可运用PCA或类似的方法对出现向量降维,从而得到更稠密的表示。虽然这种方法具有很好的性能,但它要求我们追踪所有词汇的共生矩阵,即一个宽度和高度均为词汇表长度的方阵。


     2013年,Mikolov等提出了一种依据上下文计算此表示的实用有效的方法,skip-gram模型从随机表示开始,并拥有一个试图依据当前词语预测一个上下文词语的简单分类器。误差同时通过分类器权值和词的表示进行传播,需要对这两者进行调整以减少预测误差。研究发现,在大规模语料库上训练模型可表示向量逼近压缩后的共生向量。下面利用TensorFlow实现skip-gram模型。

Efficient estimation of word representations in vector space

      准备维基百科语料库:使用英文维基百科转储文件。默认包含所有页面的完整修订历史,

                     https://dumps.wikimedia.org/backup-index.html

      为了以正确的格式表示数据,还需执行若干步骤。数据收集和清洗是非常迫切和重要的任务。遍历表示为one-hot编码词语的维基页面。需要完成下列步骤:

1)下载转储文件,提取页面及其中的词语

2)统计词语的出现次数,构建一个由最常见词语构成的词汇表。

3)利用该词汇表对提取的页面进行编码。

      模型结构

      噪声对比分类器

      训练模型:完整的语料库,https://dumps.wikimedia.org/enwiki/20160501/enwiki-20160501-pages-meta-current.xml.bz2

Wikepedia.py

import bz2
import collections
import os
import re
from lxml import etree
from helpers import download

class Wikipedia:
    TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')
    def __init__(self, url, cache_dir, vocabulary_size=10000):
        self._cache_dir = os.path.expanduser(cache_dir)
        self._pages_path = os.path.join(self._cache_dir, 'pages.bz2')
        self._vocabulary_path = os.path.join(self._cache_dir, 'vocabulary.bz2')
        if not os.path.isfile(self._pages_path):
            print('Read pages')
            self._read_pages(url)
        if not os.path.isfile(self._vocabulary_path):
            print('Build vocabulary')
            self._build_vocabulary(vocabulary_size)
        with bz2.open(self._vocabulary_path, 'rt') as vocabulary:
            print('Read vocabulary')
            self._vocabulary = [x.strip() for x in vocabulary]
        self._indices = {x: i for i, x in enumerate(self._vocabulary)}

    def __iter__(self):
        """Iterate over pages represented as lists of word indices."""
        with bz2.open(self._pages_path, 'rt') as pages:
            for page in pages:
                words = page.strip().split()
                words = [self.encode(x) for x in words]
                yield words

    @property
    def vocabulary_size(self):
        return len(self._vocabulary)

    def encode(self, word):
        """Get the vocabulary index of a string word."""
        return self._indices.get(word, 0)

    def decode(self, index):
        """Get back the string word from a vocabulary index."""
        return self._vocabulary[index]

    def _read_pages(self, url):
        """
        Extract plain words from a Wikipedia dump and store them to the pages
        file. Each page will be a line with words separated by spaces.
        """
        wikipedia_path = download(url, self._cache_dir)
        with bz2.open(wikipedia_path) as wikipedia, \
                bz2.open(self._pages_path, 'wt') as pages:
            for _, element in etree.iterparse(wikipedia, tag='{*}page'):
                if element.find('./{*}redirect') is not None:
                    continue
                page = element.findtext('./{*}revision/{*}text')
                words = self._tokenize(page)
                pages.write(' '.join(words) + '\n')
                element.clear()

    def _build_vocabulary(self, vocabulary_size):
        """
        Count words in the pages file and write a list of the most frequent
        words to the vocabulary file.
        """
        counter = collections.Counter()
        with bz2.open(self._pages_path, 'rt') as pages:
            for page in pages:
                words = page.strip().split()
                counter.update(words)
        common = ['<unk>'] + counter.most_common(vocabulary_size - 1)
        common = [x[0] for x in common]
        with bz2.open(self._vocabulary_path, 'wt') as vocabulary:
            for word in common:
                vocabulary.write(word + '\n')

    @classmethod
    def _tokenize(cls, page):
        words = cls.TOKEN_REGEX.findall(page)
        words = [x.lower() for x in words]
        return words

batch.py

import numpy as np

def batched(iterator, batch_size):
    """Group a numerical stream into batches and yield them as Numpy arrays."""
    while True:
        data = np.zeros(batch_size)
        target = np.zeros(batch_size)
        for index in range(batch_size):
            data[index], target[index] = next(iterator)
        yield data, target

skipgrams.py

import random

def skipgrams(pages, max_context):
    """Form training pairs according to the skip-gram model."""
    for words in pages:
        for index, current in enumerate(words):
            context = random.randint(1, max_context)
            for target in words[max(0, index - context): index]:
                yield current, target
            for target in words[index + 1: index + context + 1]:
                yield current, target

EmeddingModel.py

import tensorflow as tf
import numpy as np
from helpers import lazy_property

class EmbeddingModel:
    def __init__(self, data, target, params):
        self.data = data
        self.target = target
        self.params = params
        self.embeddings
        self.cost
        self.optimize

    @lazy_property
    def embeddings(self):
        initial = tf.random_uniform(
            [self.params.vocabulary_size, self.params.embedding_size],
            -1.0, 1.0)
        return tf.Variable(initial)

    @lazy_property
    def optimize(self):
        optimizer = tf.train.MomentumOptimizer(
            self.params.learning_rate, self.params.momentum)
        return optimizer.minimize(self.cost)

    @lazy_property
    def cost(self):
        embedded = tf.nn.embedding_lookup(self.embeddings, self.data)
        weight = tf.Variable(tf.truncated_normal(
            [self.params.vocabulary_size, self.params.embedding_size],
            stddev=1.0 / self.params.embedding_size ** 0.5))
        bias = tf.Variable(tf.zeros([self.params.vocabulary_size]))
        target = tf.expand_dims(self.target, 1)
        return tf.reduce_mean(tf.nn.nce_loss(
            weight, bias, embedded, target,
            self.params.contrastive_examples,
            self.params.vocabulary_size))


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值