NLP | Word2Vec之基于Negative Sampling的 CBOW 和 skip-gram 模型

10 篇文章 9 订阅

前面介绍了基于Hierarchical Softmax的 skip-gram 和 CBOW 模型,虽然我们使用霍夫曼树代替传统的神经网络,可以提高模型训练的效率。但是如果我们的训练样本里的中心词 w w w是一个很生僻的词,那么就得在霍夫曼树中辛苦的向下走很久了。能不能不用搞这么复杂的一颗霍夫曼树,将模型变的更加简单呢?

Negative Sampling就是这么一种求解word2vec模型的方法,它摒弃了霍夫曼树,采用了Negative Sampling(负采样)的方法来求解,下面我们就来看看Negative Sampling的求解思路。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

Skip-gram

Tensorflow 实现Skip-gram(电影评分数据集)

导入所需的库

# 导入库
from tensorflow.python.framework import ops
ops.reset_default_graph()
import tensorflow as tf
sess = tf.Session()
import matplotlib.pyplot as plt
import numpy as np
import random
import os # OS模块可以处理文件和目录这些我们日常手动需要做的操作。
import string # 可以调用与字符串操作相关的函数。
import requests # requests是python的一个HTTP客户端库
import collections
"""
collections提供了一下数据类型,很厉害
1.namedtuple(): 生成可以使用名字来访问元素内容的tuple子类
2.deque: 双端队列,可以快速的从另外一侧追加和推出对象
3.Counter: 计数器,主要用来计数
4.OrderedDict: 有序字典
5.defaultdict: 带有默认值的字典
"""
import io #  'StringIO',BytesIO',...
import gzip
import tarfile # 压缩、解压
import urllib.request
"""
Urllib是python内置的HTTP请求库包括以下模块
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块
"""
from nltk.corpus import stopwords

声明一些模型参数

batch_size = 50 # 一次查找50对词嵌套。
embedding_size = 200 # 每个单词嵌套大小为长度200的向量。
vocabulary_size = 10000 # 仅考虑频次为10000的单词
generations = 50000 # 迭代训练50000次
print_loss_every = 500 # 迭代500次打印一次损失函数
num_sampled = int(batch_size/2) # Number of negative examples to sample.
window_size = 3 # 查找目标单词两边各两个上下文单词
stops = stopwords.words('english')
print_valid_every = 2000 # 每迭代2000次训练打印最近邻域单词
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']

数据加载。首先检验是否已经下载了数据,如果下载了就直接加载就行,没有下载的话就去下载

def load_movie_data():
    save_folder_name = 'temp'
    pos_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.pos')
    neg_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.neg')
    if not os.path.exists(os.path.join(save_folder_name, 'rt-polaritydata')):
        movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'
        req = requests.get(movie_data_url, stream=True)
        with open('temp_movie_review_temp.tar.gz', 'wb') as f:
            for chunk in req.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()
        # Extract tar.gz file into temp folder
        tar = tarfile.open('temp_movie_review_temp.tar.gz', "r:gz")
        tar.extractall(path='temp')
        tar.close()
    pos_data = []
    with open(pos_file, 'r', encoding='latin-1') as f:
        for line in f:
            pos_data.append(line.encode('ascii', errors='ignore').decode())
    f.close()
    pos_data = [x.rstrip() for x in pos_data]
    neg_data = []
    with open(neg_file, 'r', encoding='latin-1') as f:
        for line in f:
            neg_data.append(line.encode('ascii', errors='ignore').decode())
    f.close()
    neg_data = [x.rstrip() for x in neg_data] # Python rstrip() 删除 string 字符串末尾的指定字符(默认为空格
    texts = pos_data + neg_data
    target = [1] * len(pos_data) + [0] * len(neg_data)
    return texts, target
texts, target = load_movie_data()

结果:
[‘the rock is destined to be the 21st century’s new " conan " and that he’s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .’, ‘the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j . r . r . tolkien’s middle-earth .’, ‘effective but too-tepid biopic’, ‘if you sometimes like to go to the movies to have fun , wasabi is a good place to start .’, “emerges as something rare , an issue movie that’s so honest and keenly observed that it doesn’t feel like one .”]
[1, 1, 1, 1, 1]

创建归一化文本函数。大小写、标点、数字、空格、停用词

def normalize_text(texts, stopwords):
    texts = [x.lower() for x in texts]
    texts = [''.join(x for x in text if x not in string.punctuation) for text in texts]
    print(texts[0:2])
    print("---------------------------------------------------------------------------------------")
    texts = [''.join(x for x in text if x not in '0123456789') for text in texts]
    print(texts[0:2])
    print("---------------------------------------------------------------------------------------")
    texts = [' '.join([word for word in text.split() if word not in stops]) for text in texts]
    print(texts[0:2])
    print("---------------------------------------------------------------------------------------")
    texts = [' '.join(text.split()) for text in texts]
    return texts
texts = normalize_text(texts, stops)
print(texts[0:5])

输出结果:
['the rock is destined to be the 21st centurys new conan and that hes going to make a splash even greater than arnold schwarzenegger jeanclaud van damme or steven segal ', 'the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe cowriterdirector peter jacksons expanded vision of j r r tolkiens middleearth ']

['the rock is destined to be the st centurys new conan and that hes going to make a splash even greater than arnold schwarzenegger jeanclaud van damme or steven segal ', 'the gorgeously elaborate continuation of the lord of the rings trilogy is so huge that a column of words cannot adequately describe cowriterdirector peter jacksons expanded vision of j r r tolkiens middleearth ']

[‘rock destined st centurys new conan hes going make splash even greater arnold schwarzenegger jeanclaud van damme steven segal’, ‘gorgeously elaborate continuation lord rings trilogy huge column words cannot adequately describe cowriterdirector peter jacksons expanded vision j r r tolkiens middleearth’]

[‘rock destined st centurys new conan hes going make splash even greater arnold schwarzenegger jeanclaud van damme steven segal’, ‘gorgeously elaborate continuation lord rings trilogy huge column words cannot adequately describe cowriterdirector peter jacksons expanded vision j r r tolkiens middleearth’, ‘effective tootepid biopic’, ‘sometimes like go movies fun wasabi good place start’, ‘emerges something rare issue movie thats honest keenly observed doesnt feel like one’]

为了确保影评的有效性,强制影评长度大于等于三个单词。然后构建词汇表,从而实现{word:频次}的映射

target = [target[ix] for ix,x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

# 构建词汇表,创建函数来建立一个单词字典(单词和单词数对),词频不够的单词标记为RARE
def build_dictionary(sentences, vocabulary_size):
    # Turn sentences (list of strings) into lists of words
    split_sentences = [s.split() for s in sentences]
    print(split_sentences[0:5])
    print("=============================================================")
    words = [x for sublist in split_sentences for x in sublist]
    print(words[0:5])
    print("=============================================================")
    # Initialize list of [word, word_count] for each word, starting with unknown
    count = [['RARE', -1]]
    # Now add most frequent words, limited to the N-most frequent (N=vocabulary size)
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    # Now create the dictionary
    word_dict = {}
    # For each word, that we want in the dictionary, add it, then make it
    # the value of the prior dictionary length
    for word, word_count in count:
        word_dict[word] = len(word_dict)
    return word_dict
word_dict = build_dictionary(texts, vocabulary_size)
for k, v in word_dict.items():
    print(k + ":" + str(v))

输出结果:
[[‘rock’, ‘destined’, ‘st’, ‘centurys’, ‘new’, ‘conan’, ‘hes’, ‘going’, ‘make’, ‘splash’, ‘even’, ‘greater’, ‘arnold’, ‘schwarzenegger’, ‘jeanclaud’, ‘van’, ‘damme’, ‘steven’, ‘segal’], [‘gorgeously’, ‘elaborate’, ‘continuation’, ‘lord’, ‘rings’, ‘trilogy’, ‘huge’, ‘column’, ‘words’, ‘cannot’, ‘adequately’, ‘describe’, ‘cowriterdirector’, ‘peter’, ‘jacksons’, ‘expanded’, ‘vision’, ‘j’, ‘r’, ‘r’, ‘tolkiens’, ‘middleearth’], [‘effective’, ‘tootepid’, ‘biopic’], [‘sometimes’, ‘like’, ‘go’, ‘movies’, ‘fun’, ‘wasabi’, ‘good’, ‘place’, ‘start’], [‘emerges’, ‘something’, ‘rare’, ‘issue’, ‘movie’, ‘thats’, ‘honest’, ‘keenly’, ‘observed’, ‘doesnt’, ‘feel’, ‘like’, ‘one’]]

[‘rock’, ‘destined’, ‘st’, ‘centurys’, ‘new’]

thats:39
idiocy:4039
tenacious:5294
stand:685
authenticity:3966
worry:3967
familial:3092
。。。此处省略10000字。。。

将一系列的句子转化成单词索引列表,并将单词索引列表传入嵌套寻找函数

def text_to_numbers(sentences, word_dict):
    data = []
    for sentence in sentences:
        sentence_data = []
        for word in sentence:
            if word in word_dict:
                word_ix = word_dict[word]
            else:
                word_ix = 0
            sentence_data.append(word_ix)
        data.append(sentence_data)
    return data

# 创建单词字典、转换句子列表为单词索引列表
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys())) # 键值互换
text_data = text_to_numbers(texts, word_dictionary)
print(text_data[0:5])
# 从预处理的单词词典中查找用于验证的单词的索引
valid_examples = [word_dictionary[x] for x in valid_words]

print(valid_examples)
print(word_dictionary["rock"])
print(word_dictionary_rev[546])

[[2708, 0, 1872, 1044, 0, 0, 753, 0, 0, 0, 2408, 753, 0, 0, 0, 0, 0, 1872, 753, 2408, 0, 1775, 2708, 0, 0, 0, 2408, 753, 2754, 0, 1872, 0, 2408, 0, 2408, 0, 2044, 753, 0, 0, 2637, 0, 0, 2408, 2637, 0, 0, 0, 1044, 753, 0, 0, 4294, 4227, 0, 0, 2044, 0, 753, 8725, 753, 2408, 0, 2637, 2708, 753, 0, 0, 753, 2708, 0, 0, 2708, 2408, 0, 4227, 0, 0, 0, 1872, 2044, 2754, 0, 2708, 0, 753, 2408, 753, 2637, 2637, 753, 2708, 0, 1150, 753, 0, 2408, 1872, 4227, 0, 1775, 0, 0, 8725, 0, 2408, 0, 0, 0, 0, 0, 753, 0, 0, 0, 753, 8725, 753, 2408, 0, 0, 753, 2637, 0, 4227], [2637, 0, 2708, 2637, 753, 0, 1775, 0, 4227, 0, 0, 753, 4227, 0, 1363, 0, 2708, 0, 0, 753, 0, 1872, 0, 2408, 0, 0, 2408, 1775, 0, 0, 0, 0, 2408, 0, 4227, 0, 2708, 0, 0, 2708, 0, 2408, 2637, 0, 0, 0, 2708, 0, 4227, 0, 2637, 0, 0, 2044, 1775, 2637, 753, 0, 1872, 0, 4227, 1775, 0, 2408, 0, 2754, 0, 2708, 0, 0, 0, 1872, 0, 2408, 2408, 0, 0, 0, 0, 0, 753, 2864, 1775, 0, 0, 753, 4227, 0, 0, 0, 753, 0, 1872, 2708, 0, 1363, 753, 0, 1872, 0, 2754, 2708, 0, 0, 753, 2708, 0, 0, 2708, 753, 1872, 0, 0, 2708, 0, 4294, 753, 0, 753, 2708, 0, 1150, 0, 1872, 1044, 0, 0, 2408, 0, 0, 753, 979, 4294, 0, 2408, 0, 753, 0, 0, 8725, 0, 0, 0, 0, 2408, 0, 1150, 0, 2708, 0, 2708, 0, 0, 0, 4227, 1044, 0, 753, 2408, 0, 0, 0, 0, 0, 0, 4227, 753, 753, 0, 2708, 0, 2044], [753, 9010, 9010, 753, 1872, 0, 0, 8725, 753, 0, 0, 0, 0, 0, 753, 4294, 0, 0, 0, 1363, 0, 0, 4294, 0, 1872], [0, 0, 0, 753, 0, 0, 0, 753, 0, 0, 4227, 0, 1044, 753, 0, 2637, 0, 0, 0, 0, 8725, 0, 753, 0, 0, 9010, 1775, 2408, 0, 2754, 0, 0, 0, 1363, 0, 0, 2637, 0, 0, 0, 0, 4294, 4227, 0, 1872, 753, 0, 0, 0, 0, 2708, 0], [753, 0, 753, 2708, 2637, 753, 0, 0, 0, 0, 0, 753, 0, 2044, 0, 2408, 2637, 0, 2708, 0, 2708, 753, 0, 0, 0, 0, 1775, 753, 0, 0, 0, 8725, 0, 753, 0, 0, 2044, 0, 0, 0, 0, 2044, 0, 2408, 753, 0, 0, 0, 1044, 753, 753, 2408, 4227, 0, 0, 0, 1363, 0, 753, 2708, 8725, 753, 0, 0, 0, 0, 753, 0, 2408, 0, 0, 9010, 753, 753, 4227, 0, 4227, 0, 1044, 753, 0, 0, 2408, 753]]

[1423, 28, 952, 200, 360]

546

rock

创建skip-gram模型的批量函数。希望在单词对上训练模型,第一个单词作为输入(目标单词),另一个单词作为输出。比如“the cat in the hat”,目标单词为in, 上下文窗口大小设置为2,那么训练单词对为(in, the), (in, cat), (in, the), (in, hat)。
需要注意:窗口大小为2的话包含四个单词,前面讲的四个单词中包含目标单词,而代码里包含除目标单词外的另四个单词(也就是五个单词)。其实也不影响理解。

# 创建函数返回skip-gram模型的批量数据
def generate_batch_data(sentences, batch_size, window_size, method='skip-gram'):
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        rand_sentence = np.random.choice(sentences) # 随机选择一个开始
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix,x in enumerate(rand_sentence)]
        label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)] # 寻找中心点单词
        # 为每个窗口取出中心感兴趣的单词,并为每个窗口创建一个元组
        if method == 'skip-gram': # 从目标单词预测上下文
            batch_and_labels = [(x[y],x[:y]+x[(y+1):]) for x,y in zip(window_sequences,label_indices)] 
            tuple_data = [(x,y_) for x,y in batch_and_labels for y_ in y] # (target word, surrounding word)
        elif method == 'cbow': # 利用上下文预测目标单词
            batch_and_labels = [(x[:y]+x[(y+1):],x[y]) for x,y in zip(window_sequences,label_indices)] # (target word, surrounding word)
            tuple_data = [(x_, y) for x, y in batch_and_labels for x_ in x]# (target word, surrounding word)
        else:
            raise ValueError('Method {} not implemented yet.'.format(method))
        batch,labels = [list(x) for x in zip(*tuple_data)] # 提取batch和label
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
    batch_data = batch_data[:batch_size] # 在结尾处修剪批次和标签
    label_data = label_data[:batch_size]
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))
    return batch_data,label_data
batch_data, label_data = generate_batch_data(text_data, batch_size, window_size, method='skip-gram')

for i in range(4):
    print(str(batch_data[i+8]) + ' ' + str(label_data[i+8]))

输出结果:
0 [2408]
0 [0]
0 [4294]
0 [753]

初始化嵌套矩阵,声明占位符和嵌套从查找函数

# 初始化嵌套矩阵
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size]))

# 创建占位符
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# 嵌套查找函数
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

softmax是实现多分类常用的损失函数,但是本例目标是10000个不同的分类,稀疏性非常高,导致算法模型拟合或收敛的问题。为了解决这个问题使用噪声对比损失函数NCE。NCE损失函数将问题转化成一个二值预测,预测单词分类和随机噪声。

# NCE损失函数
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                              stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_biases,
                                     labels=y_target,
                                     inputs=embed,
                                     num_sampled=num_sampled,
                                     num_classes=vocabulary_size))

通过计算单词集与所有词向量之间的余弦相似度寻找验证单词周围的单词

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)

声明优化器函数,初始化模型变量

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

# Add variable initializer.
init = tf.global_variables_initializer()
sess.run(init)

迭代训练单词嵌套

loss_vec = []
loss_x_vec = []
for i in range(generations):
    batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, window_size)
    feed_dict = {x_inputs: batch_inputs, y_target: batch_labels}

    # Run the train step
    sess.run(optimizer, feed_dict=feed_dict)

    # Return the loss
    if (i + 1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print('Loss at step {} : {}'.format(i+1, loss_val))
      
    # Validation: Print some random words and top 5 related words
    if (i+1) % print_valid_every == 0:
        sim = sess.run(similarity, feed_dict=feed_dict)
        for j in range(len(valid_words)):
            valid_word = valid_words[j]
            # top_k = number of nearest neighbors
            top_k = 5
            nearest = (-sim[j, :]).argsort()[1:top_k+1]
            log_str = "Nearest to {}:".format(valid_word)
            for k in range(top_k):
                close_word = word_dictionary_rev[nearest[k]]
                log_str = '{} {},'.format(log_str, close_word)
            print(log_str)

输出结果:
。。。。。。此处略去10000字 。。。。。。。。。。
Loss at step 48500 : 1.0923460721969604
Loss at step 49000 : 0.956238865852356
Loss at step 49500 : 0.19344420731067657
Loss at step 50000 : 1.9008288383483887
Nearest to cliche: enticing, mcgrath, media, hitandmiss, guiltypleasure,
Nearest to love: witty, rant, punchdrunk, brisk, scene,
Nearest to hate: commerce, cannot, proven, constant, triste,
Nearest to silly: dramatization, wellshot, fabulous, tuned, death,
Nearest to sad: chouchou, tomcats, enduring, kicking, patric,

CBOW

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import pickle
import string
import requests
import collections
import io
import tarfile
import urllib.request
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
ops.reset_default_graph()

tf.set_random_seed(42)
np.random.seed(42)

data_folder_name = 'temp'
if not os.path.exists(data_folder_name):
    os.makedirs(data_folder_name)

sess = tf.Session()

batch_size = 500
embedding_size = 200
vocabulary_size = 2000
generations = 50000
model_learning_rate = 0.25
num_sampled = int(batch_size/2)    # Number of negative examples to sample.
window_size = 3       # How many words to consider left and right.
save_embeddings_every = 5000
print_valid_every = 5000
print_loss_every = 100
stops = stopwords.words('english')
valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']

# Normalize text
def normalize_text(texts, stops):
    texts = [x.lower() for x in texts]
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]
    texts = [' '.join(x.split()) for x in texts]
    return(texts)

# Build dictionary of words
def build_dictionary(sentences, vocabulary_size):
    split_sentences = [s.split() for s in sentences] # split_sentences 格式[['dw','dqw',...,'dqw'],[......]]
    words = [x for sublist in split_sentences for x in sublist]
    count = [['RARE', -1]]
    count.extend(collections.Counter(words).most_common(vocabulary_size-1))
    word_dict = {}
    for word, word_count in count:
        word_dict[word] = len(word_dict)  
    return(word_dict)
    
# Turn text data into lists of integers from dictionary
def text_to_numbers(sentences, word_dict):
    data = []
    for sentence in sentences:
        sentence_data = []
        for word in sentence.split(' '):
            if word in word_dict:
                word_ix = word_dict[word]
            else:
                word_ix = 0
            sentence_data.append(word_ix)
        data.append(sentence_data)
    return(data)
    
# Generate data randomly (N words behind, target, N words ahead)
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
    batch_data = []
    label_data = []
    while len(batch_data) < batch_size:
        rand_sentence_ix = int(np.random.choice(len(sentences), size=1))
        rand_sentence = sentences[rand_sentence_ix]
        window_sequences = [rand_sentence[max((ix-window_size),0):(ix+window_size+1)] for ix, x in enumerate(rand_sentence)]
        # Denote which element of each window is the center word of interest
        label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
        # Pull out center word of interest for each window and create a tuple for each window
        batch, labels = [], []
        if method=='skip_gram':
            batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
            # Make it in to a big list of tuples (target word, surrounding word)
            tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
            if len(tuple_data) > 0:
                batch, labels = [list(x) for x in zip(*tuple_data)]
        elif method=='cbow':
            batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in zip(window_sequences, label_indices)]
            # Only keep windows with consistent 2*window_size
            batch_and_labels = [(x,y) for x,y in batch_and_labels if len(x)==2*window_size]
            if len(batch_and_labels) > 0:
                batch, labels = [list(x) for x in zip(*batch_and_labels)]
        elif method=='doc2vec':
            # For doc2vec we keep LHS window only to predict target word
            batch_and_labels = [(rand_sentence[i:i+window_size], rand_sentence[i+window_size]) for i in range(0, len(rand_sentence)-window_size)]
            batch, labels = [list(x) for x in zip(*batch_and_labels)]
            # Add document index to batch!! Remember that we must extract the last index in batch for the doc-index
            batch = [x + [rand_sentence_ix] for x in batch]
        else:
            raise ValueError('Method {} not implemented yet.'.format(method))
        # extract batch and labels
        batch_data.extend(batch[:batch_size])
        label_data.extend(labels[:batch_size])
    # Trim batch and label at the end
    batch_data = batch_data[:batch_size]
    label_data = label_data[:batch_size]
    # Convert to numpy array
    batch_data = np.array(batch_data)
    label_data = np.transpose(np.array([label_data]))
    return batch_data, label_data

# Load the movie review data
# Check if data was downloaded, otherwise download it and save for future use
def load_movie_data():
    save_folder_name = 'temp'
    pos_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.pos')
    neg_file = os.path.join(save_folder_name, 'rt-polaritydata', 'rt-polarity.neg')

    # Check if files are already downloaded
    if not os.path.exists(os.path.join(save_folder_name, 'rt-polaritydata')):
        movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz'

        # Save tar.gz file
        req = requests.get(movie_data_url, stream=True)
        with open('temp_movie_review_temp.tar.gz', 'wb') as f:
            for chunk in req.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()
        # Extract tar.gz file into temp folder
        tar = tarfile.open('temp_movie_review_temp.tar.gz', "r:gz")
        tar.extractall(path='temp')
        tar.close()
    pos_data = []
    with open(pos_file, 'r', encoding='latin-1') as f:
        for line in f:
            pos_data.append(line.encode('ascii',errors='ignore').decode())
    f.close()
    pos_data = [x.rstrip() for x in pos_data]
    neg_data = []
    with open(neg_file, 'r', encoding='latin-1') as f:
        for line in f:
            neg_data.append(line.encode('ascii',errors='ignore').decode())
    f.close()
    neg_data = [x.rstrip() for x in neg_data]
    texts = pos_data + neg_data
    target = [1]*len(pos_data) + [0]*len(neg_data)
    return(texts, target)

data_folder_name = 'temp'
if not os.path.exists(data_folder_name):
    os.makedirs(data_folder_name)
    
texts, target = load_movie_data()
texts = normalize_text(texts, stops)
target = [target[ix] for ix,x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

# 创建单词字典,以便查找单词
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))
text_data = text_to_numbers(texts, word_dictionary)
valid_examples = [word_dictionary[word] for word in valid_words]

# 初始化待拟合的单词嵌套并声明算法模型的数据占位符
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1., 1.))
x_inputs = tf.placeholder(shape=[batch_size, 2*window_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[batch_size, 1], dtype=tf.int32)
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# 处理单词嵌套
embed = tf.zeros([batch_size, embedding_size])
for element in range(2*window_size):
    embed += tf.nn.embedding_lookup(embeddings, x_inputs[:,element])
   
# NCE损失函数
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size], stddev=1.0/np.sqrt(embedding_size)))
nce_bias = tf.Variable(tf.zeros(vocabulary_size))
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_bias,
                                     labels=y_target,
                                     inputs=embed,
                                     num_sampled=num_sampled,
                                     num_classes=vocabulary_size))

# 利用余弦相似度度量最接近的单词
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings,normalized_embeddings,transpose_b=True )

# 保存词向量
saver = tf.train.Saver({"embeddings":embeddings})

# 声明优化器函数,初始化变量
optmizer = tf.train.GradientDescentOptimizer(learning_rate=model_learning_rate).minimize(loss)
init = tf.global_variables_initializer()
sess.run(init)

# 遍历迭代、打印损失函数、保存单词嵌套
loss_vec = []
loss_x_vec = []
text_data = [x for x in text_data if len(x)>=(2*window_size+1)]
for i in range(generations):
    batch_inputs, batch_labels = generate_batch_data(text_data,batch_size,window_size,method="cbow")
    feed_dict = {x_inputs:batch_inputs, y_target:batch_labels}
    sess.run(optmizer, feed_dict=feed_dict)
    if (i+1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print('loss at step {}:{}'.format(i+1,loss_val))
    if (i+1) % print_valid_every == 0:
        sim = sess.run(similarity, feed_dict=feed_dict)
        for j in range(len(valid_words)):
            valid_word = word_dictionary_rev[valid_examples[j]]
            top_k = 5
            nearest = (-sim[j,:]).argsort()[1:top_k+1]
            log_str = "Nearest to {}".format(valid_word)
            for k in range(top_k):
                close_word = word_dictionary_rev[nearest[k]]
                print_str = '{} {}'.format(log_str, close_word)
            print(print_str)
    if (i+1) % save_embeddings_every == 0:
        with open(os.path.join(data_folder_name, 'movie_vocab.pkl'), 'wb') as f:
            pickle.dump(word_dictionary, f)
        model_checkpoint_path = os.path.join(os.getcwd(), data_folder_name, 'cbow_movie_embeddings.ckpt')
        save_path = saver.save(sess, model_checkpoint_path)
        print('Model saved in file: {}'.format(save_path))

输出结果:
。。。。。。。。此处省略一万字。。。。。。.。。。
loss at step 49200:2.3909201622009277
loss at step 49300:2.5774049758911133
loss at step 49400:2.058192491531372
loss at step 49500:2.1763217449188232
loss at step 49600:2.239227056503296
loss at step 49700:2.235402822494507
loss at step 49800:2.0298542976379395
loss at step 49900:2.2141225337982178
loss at step 50000:2.3587381839752197
Nearest to love side
Nearest to hate talent
Nearest to happy taste
Nearest to sad energy
Nearest to man irritating
Nearest to woman chemistry
Model saved in file: D:\anaconda\envs\tensorflow\Scripts\Tensorflow机器学习实战指南\temp\cbow_movie_embeddings.ckpt

参考:https://www.cnblogs.com/pinard/p/7249903.html

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值