NLP【04】tensorflow 实现Wordvec(附代码详解)

上一篇:NLP【03】白话glove原理

下一篇:NLP【05】pytorch实现glove词向量(附代码详解)

完整代码下载:https://github.com/ttjjlw/NLP/tree/main/Word_vector%E8%AF%8D%E5%90%91%E9%87%8F/word2vec/tf1.x

一、前言

现在都快2021年了,是不是就不会用word2vec、glove或fastext训练词向量了,而全部使用bert?,答案当然是否定的,word2vec,glove或fastext相对于bert最大的优势就是轻量,就像逻辑回归模型一样,在工作中,还是会经常用到。

二、代码结构

tf1.x

   data  训练数据

  dataset.py 数据处理

  main.py   主程序运行

  model.py wordvec模型

 README.md

三、代码详解

1、数据处理部分——dataset.py

# F:\itsoftware\Anaconda
# -*- coding:utf-8 -*-
# Author = TJL
# date:2020/11/16
import numpy as np
import pandas as pd
import os, pickle, re, jieba, collections

def delete_tag(s):
    '''
    对一段文本进行过滤
    :param s:string 文本
    :return: string 文本
    '''
    s = re.sub('\{IMG:.?.?.?\}', '', s)  # 图片
    s = re.sub(re.compile(r'[a-zA-Z]+://[^\u4e00-\u9fa5]+'), '', s)  # 网址
    s = re.sub(re.compile('<.*?>'), '', s)  # 网页标签
    s = re.sub(re.compile('&[a-zA-Z]+;?'), ' ', s)  # 网页标签
    # s = re.sub(re.compile('[a-zA-Z0-9]*[./]+[a-zA-Z0-9./]+[a-zA-Z0-9./]*'), ' ', s)
    # s = re.sub("\?{2,}", "", s)
    # s = re.sub("\r", "", s)
    # s = re.sub("\n", ",", s)
    s = re.sub("\t", "", s)
    s = re.sub("(", "", s)
    s = re.sub(")", "", s)
    s = re.sub("\u3000", "", s)  # 全角空格(中文符号)
    s = re.sub(" ", "", s)
    r4 = re.compile('\d{4}[-/]\d{2}[-/]\d{2}')  # 日期
    s = re.sub(r4, '某时', s)
    s = re.sub('“', '"', s)
    s = re.sub('”', '"', s)
    return s


def get_dictionary(col):
    '''
    获取包含语料中所有词的字典
    :param corpus: sring 语料
    :return:dict 字典
    '''
    corpus = []
    for line in col:
        words = jieba.lcut(line)
        corpus.extend(words)
    counter = dict(collections.Counter(corpus).most_common(100000))
    word2id = {}
    for i, w in enumerate(counter):
        word2id[w] = i + 1
    word2id['<pad>'] = 0
    word2id['<unk>'] = len(word2id)
    # print(word2id['<unk>'])
    # print(max(list(word2id.values())))
    return word2id


def text2id(text, vocab):
    '''
    把一段文本转化成number
    :param text: string 一段文本
    :param vocab: dict 语料字典
    :return: list number list
    '''
    word = jieba.lcut(text)
    id = [vocab.get(w, len(vocab) - 1) for w in word]  # 注意是len(vocab)-1
    return id

def get_data(data, vocab):
    '''
    把一行行的文本全部进行to id
    :param data:一行行的文本
    :param batch_size:
    :param vocab:
    :param shuffle:
    :param max_len:
    :return:
    '''
    input_ids = []
    for text in data:
        input_id = text2id(text, vocab=vocab)
        input_ids.extend(input_id)
    return input_ids

data_index=0
def build_batch(raw_data,vocab,batch_size,window_size=1):
    '''
    假设raw_data经id化后变为[1,2,3],最后经过该函数后train_batch=[2,2],train_label=[1,3]
    :param raw_data: string 原始一行行的文本
    :param vocab: dict 语料字典
    :param batch_size: 是skip_window*2的倍数
    :param window_size: 窗口大小
    :return:
    '''
    global data_index           #data_index=0,不能放在里面,因为每次生成batch,data_index都要继承前面的
    num_skip=window_size*2
    assert batch_size % num_skip == 0
    data=get_data(raw_data,vocab)
    train_batch=np.ndarray(shape=(batch_size),dtype=np.int32)
    train_label=np.ndarray(shape=(batch_size,1),dtype=np.int32)
    span=2*window_size+1 #入队长度
    deque=collections.deque(maxlen=span) #创建双向队列,如deque=[1,2,3],deque.append(4),则deque=[2,3,4]
    #初始化deque,把data前三个元素,放入deque中
    for _ in range(span):
        deque.append(data[data_index])#修改于2019-12-28
        data_index+=1
    for i in range(batch_size//num_skip):
        for j in range(span):
            if j>window_size:
                train_batch[num_skip*i+j-1]=deque[window_size]  ##为什么是num_skip*i,num_skip表示每次i循环,train_batch,添加了几个元素,所以需要向前偏移num_skip
                train_label[num_skip*i+j-1,0]=deque[j] #中心词右侧
            elif j==window_size:
                continue
            else:
                train_batch[num_skip*i+j]=deque[window_size] #train_batch=中心词
                train_label[num_skip*i+j,0]=deque[j] #中心词左侧
        deque.append(data[data_index])
        data_index+=1
        data_index%=len(data) #防止最后一个batch时,data_index溢出
    return train_batch, train_label


if __name__ == '__main__':
    train = pd.read_csv('./data/train.csv', sep='\t', encoding='utf-8', header=0)
    print(train.count())
    vocab = get_dictionary(train.text)
    input_id, label=build_batch(raw_data=['采荷一小是分校吧','房本都是五年外的'],vocab=vocab,batch_size=4)
    print(input_id)
    print(label)

2、模型部分——model.py

# F:\itsoftware\Anaconda
# -*- coding:utf-8 -*-
# Author = TJL
# date:2020/11/23
import tensorflow as tf
import os,pickle
import numpy as np
from dataset import build_batch
class Word2vec(object):
    def __init__(self,args):
        self.mode=args.mode
        self.vocab2id=args.vocab
        self.id2vocab=dict(zip(args.vocab.values(),args.vocab.keys()))
        self.embed_dim=args.embed_dim
        self.init_rate=args.init_rate
        self.neg_samples=args.neg_samples
        self.decay_steps=args.decay_steps
        self.raw_data=args.raw_data
        self.batch_size=args.batch_size
        self.window_size=args.window_size
        self.valid_example=args.valid_example
        self.valid_size=args.valid_size
        self.top_k = args.top_k
        self.epochs=args.epochs
        self.log_per_steps=args.log_per_steps
        self.save_path=args.save_path
        self.is_save_vector=args.is_save_vector
        self.embeddings_save_path=args.embeddings_save_path
        self.is_load=args.is_load
    def build_graph(self):
        self.train_x=tf.placeholder(tf.int32,[None],name='train_x')
        self.train_y=tf.placeholder(tf.int32,[None,1],name='train_y')
        self.embeddings=tf.Variable(tf.random_uniform([len(self.vocab2id),self.embed_dim],-1,1),name='embeddings')
        if self.mode=='train':
            self.valid_data = tf.constant(self.valid_example, tf.int32, name='valid_data')
        elif self.mode=='predict':
            self.valid_data = tf.placeholder(tf.int32, shape=None)
        else:
            ValueError('mode is train or predict')
        # embeddinga=tf.Variable(tf.random_normal([vocabulary_size,embedding_size]))
        # 编码的时候要注意,频率高的词用小数字编码,文本中词的编码是否是从零开始,
        #如果不从零编码,vocabulary_size应该等于最大的那个数字编码+1,而不应是字典的长度
        self.embed=tf.nn.embedding_lookup(self.embeddings,self.train_x)
        self.nce_weight=tf.Variable(tf.truncated_normal([len(self.vocab2id),self.embed_dim],
                                                   stddev=1.0/np.sqrt(self.embed_dim)),name='nce_weight')
        self.nce_bias=tf.Variable(tf.zeros([len(self.vocab2id)]),name='nce_bias')
        norm = tf.sqrt(tf.reduce_sum(tf.square(self.embeddings), axis=1, keep_dims=True))
        self.normalized_embeddings = self.embeddings / norm  # 除以其L2范数后得到标准化后的normalized_embeddings
        self.valid_embeddings = tf.nn.embedding_lookup(self.normalized_embeddings,
                                              self.valid_data)  # 如果输入的是64,那么对应的embedding是normalized_embeddings第64行的vector
        self.similarity = tf.matmul(self.valid_embeddings, self.normalized_embeddings, transpose_b=True) #shape(20,2000) # 计算验证单词的嵌入向量与词汇表中所有单词的相似性
        print('graph build successfully!')
    def add_loss(self):
        self.nce_loss = tf.reduce_mean(tf.nn.nce_loss(inputs=self.embed, weights=self.nce_weight, biases=self.nce_bias, num_sampled=self.neg_samples, labels=self.train_y,
                           num_classes=len(self.vocab2id)))
        self.global_step=tf.Variable(0,trainable=False)
        assert self.decay_steps>0
        self.learning_rate = tf.train.exponential_decay(self.init_rate, self.global_step,self.decay_steps, 0.96)
        # self.train_ = tf.train.AdamOptimizer(self.learning_rate).minimize(self.nce_loss, global_step=self.global_step)
        self.train_ = tf.train.GradientDescentOptimizer(self.learning_rate).minimize(self.nce_loss,global_step=self.global_step)
        self.init=tf.global_variables_initializer()
    def save(self,sess):
        saver=tf.train.Saver()                      # 1/save model
        self.saved_path=saver.save(sess,self.save_path,global_step=self.global_step)
        print('{}-save model finished!'.format(self.saved_path))
    def restore(self,sess):
        restored=tf.train.Saver()
        restored.restore(sess,tf.train.latest_checkpoint(self.save_path))
        print('{}-restored Finished!'.format(tf.train.latest_checkpoint(self.save_path)))
    def train(self):
        with tf.Session() as sess:
            sess.run(self.init)
            if self.is_load:
                self.restore(sess)
            for epoch in range(self.epochs):
                for steps in range((len(self.raw_data)+self.batch_size-1)//self.batch_size):
                    train_batch,train_label=build_batch(self.raw_data,self.vocab2id,self.batch_size,self.window_size)
                    feed_dict={self.train_x:train_batch,self.train_y:train_label}
                    _,loss,learn_rate=sess.run([self.train_,self.nce_loss,self.learning_rate],feed_dict=feed_dict)
                    #每隔10000次打印一次
                    if steps%self.log_per_steps==0:
                        print('learning_rate',self.learning_rate)
                        print('loss:',loss)
                    #计算相似性
                    # 每10000次,验证单词与全部单词的相似度,并将与每个验证单词最相似的5个找出来。
                    if steps % self.log_per_steps== 0:
                        sim = self.similarity.eval()
                        for i in range(self.valid_size):
                            valid_word = self.id2vocab[self.valid_example[i]]  # 得到验证单词
                            nearest = (-sim[i, :]).argsort()[0:self.top_k+1]  # 每一个valid_example相似度最高的top-k个单词,除了自己
                            log_str = "Nearest to %s:" % valid_word
                            for index in nearest:
                                close_word_similarity = sim[i, index]
                                close_word = self.id2vocab[index]
                                log_str = "%s %s(%s)," % (log_str, close_word, close_word_similarity)
                            print(log_str)
                #每epoch保存一次模型
                self.save(sess)
    def predict(self):
        with tf.Session() as sess:
            self.restore(sess)
            # 保存词向量
            if self.is_save_vector:
                embed =self.normalized_embeddings.eval()
                with open(self.embeddings_save_path+'embed.pkl','wb') as f:
                    pickle.dump(embed,f)
                print('成功保存词向量!')
            while 1:
                word = input('请输入:')
                print(word)
                if word in ['退出', 'q']:
                    break
                if word not in self.vocab2id:
                    print('该词不在语料库中')                     #用return不会打印啊
                    continue
                value_int = self.vocab2id[word]
                value_int = np.array([value_int])
    
                sim, word_emberdding = sess.run([self.similarity, self.valid_embeddings], feed_dict={self.valid_data: value_int})
                sim_sort = (-sim[0, :]).argsort()  # index从大到小排序,index对应dictionary_reverse字典
                nearest = sim_sort[1:self.top_k + 1]  # 前top_k个,不包括自己
                log_str = "Nearest to %s:" % (word)
                for index in nearest:
                    close_word_similarity = sim[0, index]
                    close_word = self.id2vocab[index]
                    log_str = "%s: %s(%s)," % (log_str, close_word, close_word_similarity)
                print(log_str)

3、程序运行入口——main.py,所有的参数可以在这里设置

# F:\itsoftware\Anaconda
# -*- coding:utf-8 -*-
# Author = TJL
# date:2020/11/23
import os,pickle,json
import numpy as np
import argparse,random
from model import Word2vec
import pandas as pd
from dataset import get_dictionary
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # 只显示 Error
parser = argparse.ArgumentParser(description='skip model generate word2vec')
parser.add_argument('--mode', type=str, default='train',help='the mode of model,input train or predict')
parser.add_argument('--vocab', type=dict, default={},help='the vocab of corpus')
parser.add_argument('--embed_dim', type=int, default=128,help='the dim of word2vec')
parser.add_argument('--init_rate', type=float, default=0.001,help='the init learn rate')
parser.add_argument('--neg_samples', type=int, default=5,help='#Mikolov等人在论文中说:对于小数据集,负采样的个数在5-20个;对于大数据集,负采样的个数在2-5个。')
parser.add_argument('--epochs', type=int, default=10,help='the train epochs')
parser.add_argument('--log_per_steps', type=int, default=100,help='每隔多少步打印一次信息')
parser.add_argument('--decay_steps', type=int, default=1000,help='learn rate decay steps')
parser.add_argument('--is_load', type=int, default=0,help='是否加载模型再训练')
parser.add_argument('--save_path', type=str, default='export/model/',help='the save path of word2vec model')
parser.add_argument('--embeddings_save_path', type=str, default='export/embed/',help='the save path of word2vec')
parser.add_argument('--is_save_vector', type=bool, default=False,help='预测时是否同时保存词向量')
parser.add_argument('--batch_size', type=int, default=64,help='batch_size of every train')
parser.add_argument('--window_size', type=int, default=1,help='window_size of center word')
parser.add_argument('--valid_size', type=int, default=20,help='训练时取多少个词作验证')
parser.add_argument('--valid_window', type=int, default=80,help='验证单词只从频率最高的valid_window个单词中选出')
parser.add_argument('--top_k', type=int, default=5,help='输出和验证词最相似的前top_k个词')
args = parser.parse_args()

if not os.path.exists(args.save_path):os.makedirs(args.save_path)
if not os.path.exists(args.embeddings_save_path):os.makedirs(args.embeddings_save_path)

train = pd.read_csv('./data/train.csv', sep='\t', encoding='utf-8', header=0)
#valid_window 必须小于vocab的长度
#从1~valid_window范围中随机选valid_size个词用作验证
args.valid_example=np.random.choice(range(0,args.valid_window+1),args.valid_size,replace=False)#repalce=False表示不重复,不重复从0-119选出20个 type array
print(args)
args.vocab = get_dictionary(train.text)
with open(args.embeddings_save_path+'vocab.json','w') as f:
    json.dump(args.vocab,f)
args.raw_data=train.text
print('vocab_size:%d'%len(args.vocab))


args.mode='train'
args.is_save_vector=True
model=Word2vec(args)
model.build_graph()
if args.mode=='train':
    model.add_loss()
    model.train()
if args.mode=='predict':model.predict()

4、运行结果

learning_rate Tensor("ExponentialDecay:0", shape=(), dtype=float32)
loss: 35.634575
Nearest to 可以: 可以(1.0), 两点(0.32112208), 大兴(0.31846204), 乙(0.3163019), 龍崗(0.31288782), 北街(0.30585745),
Nearest to ?: ?(0.9999995), 相较(0.32597366), 上车(0.3186155), 樾(0.31433615), 可逸豪苑(0.3025732), 石榴(0.28142884),
Nearest to 在: 在(1.0000005), 开建(0.34457743), 称道(0.31910366), 宣武(0.30779788), 临河(0.28048518), 毛坯(0.27829736),
Nearest to 约: 约(0.9999997), 13500(0.3159388), FPHONE(0.3008902), 赞成(0.29938617), 要交(0.29682663), 不得(0.293569),
Nearest to 还: 还(1.0000001), 安静(0.32537153), 央产(0.30649182), 假日(0.30287346), 惠济(0.30234322), 顾兆田(0.2873835),
Nearest to 装修: 装修(1.0000002), 杂七杂八(0.3107523), 读延奎(0.29153657), 八小(0.2885743), 样子(0.28697652), 子女(0.28611988),
Nearest to 那: 那(1.0000002), 后头(0.32418457), 传媒大学(0.3079832), 九龙湖(0.30544233), 你好(0.29563653), 没装过(0.29190364),
Nearest to 学校: 学校(1.0000001), 40%(0.2966747), 公办(0.28961998), 万五(0.28938365), 装配(0.2892762), 海亮(0.28491762),
Nearest to 房子: 房子(0.9999998), 笋(0.3246024), 冬晴园(0.31347537), 豪(0.31011298), 读们(0.30903813), id(0.29701883),
Nearest to 去: 去(1.0000001), 87(0.34158114), 经济(0.339583), 小加雨(0.29129082), 完全(0.28747457), 明真宫(0.28261346),
Nearest to 也: 也(0.9999998), 更改(0.3038134), 1.2(0.29568094), 春蕾(0.28864315), 城楼(0.28849733), 点价(0.28108528),
Nearest to 请问: 请问(0.99999994), 八个(0.33434284), 缺点(0.3296368), 侯杨庄(0.31732684), 过道(0.29825735), 买满(0.28074092),
Nearest to 吧: 吧(1.0000001), 正南(0.3443914), 发多(0.33085355), 本子(0.30240005), 宝能(0.30063546), 面谈(0.2904038),
Nearest to ,: ,(0.99999994), 是不是(0.3025281), 联强(0.28256297), 西丰(0.2755395), 北京四中(0.2705009), 平米(0.26250428),
Nearest to 这个: 这个(1.0000001), 460(0.32231426), 西门(0.32161713), 865(0.30170307), 欢迎(0.29995114), 沙沟(0.29513478),
Nearest to 税费: 税费(1.0), 云溪(0.28983387), 被选为(0.28937012), 可贷(0.28244895), 非承重墙(0.2776642), 居城(0.2767468),
Nearest to 就: 就(0.9999999), 小兄弟(0.34578022), 纠纷(0.34218168), 两居(0.32131922), 了解(0.31834626), 连赎楼(0.30643332),
Nearest to 都: 都(1.0000001), 主干路(0.32485265), 10.6%(0.30476782), 接送(0.30181143), 这里(0.2980861), 多点(0.2907191),
Nearest to 是: 是(0.9999994), 祥源城(0.30970925), 3205(0.28684098), 天气(0.28475177), 税点(0.2831169), 佛(0.28231636),
Nearest to 现在: 现在(1.0000001), 伊(0.35392264), 挂下来(0.33401194), 居嘉苑(0.3044447), 要出(0.30434543), 五所(0.30244544),

 

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值