tensorflow实现Skip-Gram模型

最新推荐文章于 2024-02-28 15:55:13 发布

ML_BOY

最新推荐文章于 2024-02-28 15:55:13 发布

阅读量2.8k

点赞数

分类专栏：深度学习 tensorflow 深度学习文章标签： skim-gram tensorflow 词向量

本文链接：https://blog.csdn.net/qq1483661204/article/details/78975847

版权

tensorflow 同时被 3 个专栏收录

33 篇文章 3 订阅

订阅专栏

深度学习

27 篇文章 2 订阅

订阅专栏

深度学习

18 篇文章 3 订阅

订阅专栏

本文使用tensorflow实现skim-gram模型

导入相关库

import tensorflow as tf
import zipfile
from collections import Counter
import numpy as np
import random

对数据进行解压以及读取数据

zip_file_path = './embedding/text8.zip'
des_folder_path = './data/'
with zipfile.ZipFile(zip_file_path) as zip_ref:
    zip_ref.extractall(des_folder_path)
with open('./data/text8') as f:
    text = f.read()

对数据进行预处理

对数据进行预处理，主要是包括低频词剔除，高频词剔除，同时将标点替换，高频词进行采样

text = text.lower()  # 将字符串全部变为小写，避免同一个单词大小写问题
# 转化标点符号，因为在我们切分的时候，英文是用空格切分，但是标点符号会出现前后没有空格，那么标点前和后就会切分为一个单词了，或者标点
#前没有空格，后面有空格，那么标点就跟着前面的单词切分到一起了，具体看个例子
# x = 'I come from china,how about you?'
# x.split() 
# 结果为['I', 'come', 'from', 'china,how', 'about', 'you?']，这样单词和标点到一起了
# 同时注意细节，我们在替换标点的时候，前后会多一个空格，也是为了不和标点前后单词切分在一起
text = text.replace('.', ' <PERIOD> ')
text = text.replace(',', ' <COMMA> ')
text = text.replace('"', ' <QUOTATION_MARK> ')
text = text.replace(';', ' <SEMICOLON> ')
text = text.replace('!', ' <EXCLAMATION_MARK> ')
text = text.replace('?', ' <QUESTION_MARK> ')
text = text.replace('(', ' <LEFT_PAREN> ')
text = text.replace(')', ' <RIGHT_PAREN> ')
text = text.replace('--', ' <HYPHENS> ')
text = text.replace('?', ' <QUESTION_MARK> ')
text = text.replace('\n', ' <NEW_LINE> ')
text = text.replace(':', ' <COLON> ')
# 因为读取出来的是一个字符串，我们需要切分成一个一个单词
words = text.split()
# 在这里，我们需要剔除低频词，也是相当于降低噪声，我们设置频次低于5的我们直接剔除
word_fre = Counter(words)
words = [word for word in words if word_fre[word]>5]

构建单词映射字典

# 得到words以后，我们需要构建词和数之间的一一映射表
set_words = set(words)
word_to_int = {value:key for key,value in enumerate(set_words)}
int_to_word = {value:key for key,value in word_to_int.items()}

高频词的处理，对于高频词，在大的语料库中比如the is 等会出现上万次，但是这些词带有的有用的信息却很少，所以这些词，我们采用一个subsample的方法 ,我们和谷歌开源的word2vec工具采用的方式一样，具体如下

P (w) = 1 - (t f ( w ) ‾ ‾ ‾ ‾ ‾ \sqrt + t f ( w ))

$P(w) = 1 - (\sqrt{\frac{t}{f(w)}}+\frac{t}{f(w)})$
f(w) 表示该词频率
t 为一个阈值，需要我们给定
P(w) 表示该单词被舍弃的概率

# 设置一个阈值
t = 1e-5
# 统计词频
word_fre = Counter(words)
# 构建每个词对应的舍弃的概率的字典
words_pro_dict = {word:1-(np.sqrt(t*len(words)/word_fre[word])+t*len(words)/word_fre[word]) for word in words}
# 对每个词进行采样
words = [word for word in words if np.random.rand() > words_pro_dict[word]]

将词序变为数值型序列

# 将单词序列变为数值序列
int_word = [word_to_int[word] for word in words]

构造训练数据

这部分是为输入做准备

def get_y(batch,ind,window_size=5):
    # batch 为我们batch_size里面的单词
    # ind 表示当前词在batch中的位置
    # window_size 为窗口大小，也就是我们取当前单词前后多少个词,那么也就是总共去2×window_size个词
    # 因为我们知道，应该说离得越近，单词之间的相关性越高，此处采用一个随机窗口大小也是这个目的，让我们尽量
    # 训练的时候多使用越近的词
    window_size = np.random.randint(1,window_size+1)
    # 当我们当前单词比较靠前，索引小于window_size时，我们取当前单词前的单词，和当前单词后的window_size个词
    # 注意我们并不是说一定取2×window_size个词，而是严格按照前后各window_size个，如果前不够不往后补，后不够也不
    # 前补，举一个例子，batch=20,当前单词为3,window_size=5,那么返回的是前batch[:3]+batch[i+window_size],
    # 而不是batch[:i]+batch[i:2*window_size-i],后面不够也是一样的道理
    if ind<window_size:
        # 以下使用set是去掉重复的词，
        return list(set(batch[:ind] + batch[ind:window_size+ind]))
    # 这个地方不需要在细分当len(batch)-i<window_size的情况，也就是当前单词比较靠后的时候，这种情况可以和在中间情况合在一起的原因是
    # python中的list，比如x=[1,2,5,6,7,8],那么x[3:100]和x[3:] 其实是一样的，并不会报索引越界的错误。
    else:
        return list(set(batch[ind-window_size:ind] + batch[ind:window_size+ind]))


def get_batch(int_word,batch_size,window_size=5):
    # 计算总共有多少个batch
    n_batch = len(int_word)//batch_size
    # 此处我们把不足batch_size的舍弃
    int_word = int_word[:n_batch*batch_size]

    for j in range(0,len(int_word),batch_size):
        x_batch = []
        y_batch = []
        batch_word = int_word[j:j+batch_size]
        for n,m in enumerate(batch_word):
            x_b = m
            y_b = get_y(batch_word,n,window_size)
            x_batch.extend([x_b]*len(y_b))
            y_batch.extend(y_b)
        yield x_batch,y_batch

构建模型

定义输入的placeholder

x_input = tf.placeholder(tf.int32,[None],name='input')
y_label = tf.placeholder(tf.int32,[None,None],name='label')

定义参数

embedding_size = 300  # 词嵌入的维度，也就是最后每个词对应一个1×300的向量
vocab_size = len(int_to_word)
num_sampled = 100
lr = 0.001
epochs = 100 # 训练的次数
batch_size = 1000
window_size = 10
top_n = 8 # top_n表示我们最终选择打印词最相近的top_n个

定义loss等

# 定义隐藏层的权重大小
weight_emb = tf.Variable(tf.random_uniform([vocab_size,embedding_size],-1,1),name='weight_emb')
# tf.nn.embedding_lookup是选取张量weight_emb中对应x_input索引的值，
embed_vec = tf.nn.embedding_lookup(weight_emb,x_input)
# 输出层
# 
out_w = tf.Variable(tf.truncated_normal([vocab_size, embedding_size],mean=0,stddev=0.1))
out_b = tf.Variable(tf.zeros(vocab_size))
# tf.nn.sampled_softmax_loss 这个是tensorflow提供的候选采样的一种方式，也有其他的很多方法，主要是因为对于这种文本
# 类别太多了，训练过程非常慢，所以才有候选采样的方式，也就是每次只在一小部分样本上更新
loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(out_w,out_b,y_label,embed_vec,num_sampled,vocab_size))
optimizer = tf.train.AdamOptimizer(learning_rate=lr).minimize(loss)

定义我们的验证词，我们把验证的词前top_n相近的词打印出来

# 为了展示输出的效果，我们在训练的时候打印一些信息
# 以下是求weight_emb的每个行的模长，但我们知道，其实他的每一行就对应一个词，我们把这些词对应的向量的模长求出来，
# 然后将每个词对应的词向量变为单位向量,这样我们使用embedding_lookup取出词也是单位向量，那么计算余弦距离就可以
# 直接矩阵相乘，得到我们所要计算词的余弦距离，然后我们在排序就可以取前几个最相似的词
norm = tf.sqrt(tf.reduce_sum(tf.square(weight_emb),axis=1,keep_dims=True))
norm_embedd = weight_emb / norm
# 随机选择8个词作为我们计算最相近的词
val_data = random.choices(words,k=top_n)
val_int_data = tf.constant([word_to_int[i] for i in val_data],dtype=tf.int32)
val_int_data_embed = tf.nn.embedding_lookup(norm_embedd,val_int_data)
similarity = tf.matmul(val_int_data_embed,tf.transpose(norm_embedd))

开始训练

with tf.Session() as sess:
    saver = tf.train.Saver()
    sess.run(tf.global_variables_initializer())
    writer = tf.summary.FileWriter('./graphs/embed',sess.graph)

    for epoch in range(epochs+1):
        total_loss = 0
        for x_batch,y_batch in get_batch(int_word,batch_size,window_size=window_size):

            _,tmp_loss = sess.run([optimizer,loss],feed_dict={x_input:x_batch,y_label:np.array(y_batch)[:, None]})

            total_loss += tmp_loss
        if epoch % 10 == 0:
            print('Epoch {}/{} train_loss {}'.format(epoch,epochs,total_loss))
        if epoch % 15 == 0:
            sim = similarity.eval()
            for i,j in enumerate(val_data):
                # 之所以sim取负号是因为为了从余弦距离最大到最小排列，因为argsort返回的是最小到达排列的索引
                nearest_n = (-sim[i,:]).argsort()[1:top_n+1]
                logg = 'Nearest to %s is :' % j
                for ind,ner_int_word in enumerate(nearest_n):
                    nearest_word = int_to_word[ner_int_word]
                    logg = '%s  %s'%(logg,nearest_word)
                print(logg)       
    save_path = saver.save(sess, "checkpoints/text8.ckpt")
    embed_mat = sess.run(norm_embedd)
    writer.close()

ML_BOY

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
2
评论
tensorflow实现Skip-Gram模型

本文使用tensorflow实现skim-gram模型导入相关库import tensorflow as tfimport zipfilefrom collections import Counterimport numpy as npimport random对数据进行解压以及读取数据zip_file_path = './embedding/text8.z
复制链接

扫一扫