词向量

最新推荐文章于 2024-06-19 17:56:13 发布

Zh823275484

最新推荐文章于 2024-06-19 17:56:13 发布

阅读量5.5k

点赞数 1

分类专栏：词向量

本文链接：https://blog.csdn.net/Zh823275484/article/details/88296725

版权

词向量专栏收录该内容

3 篇文章 0 订阅

订阅专栏

词向量是自然语言处理中重要的基础，有利于我们对文本、情感、词义等等方向进行分析，主要是将词转化为稠密向量，从而使得相似的词，其词向量也相近。

一、词向量的表示

词向量的表示通常有两种方式，一种是离散的，另一种是分布式的；其离散方式通常称为one-hot representation，其缺点是不能显示词与词之间的关系，但优点是在高维空间中，很多任务线性可分。

其分布式的方式通常称为 distribution representation，是将词转化为一种分布式的、连续的、定长的稠密向量，其优点是可以表示词与词之间的距离关系，每一维度都有其特定的含义；

两者的区别是用one-hot特征时，可以对特征向量进行删减，而分布式的则不可以。

二、词向量的训练

2.1 基于统计的方法

2.1.1 共现矩阵

统计一个窗口内word共现次数，以word周边的共现词的次数做为当前word的vector。该矩阵一定程度上缓解了one-hot向量相似度为0问题，但并没有解决数据的稀疏性和高维性问题。

2.1.2 奇异值分解

针对共现矩阵存在的问题，提出了对原始词向量进行降维，从而得到一个稠密的连续词向量。利用SVD的方法，最终可以得到一个正交矩阵，进行归一化后即为词向量。

该方法的有点是可以一定程度上反映语义相近的词，以及word间的线性关系；但由于很多词没有出线，导致矩阵及其稀疏，需要对词频做额外处理才能达到好的结果，并且其矩阵也是非常大，维度高。

基于共现矩阵的词向量代码如下：

# 基于词与词构造共现矩阵，提取词向量
import collections
file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt"
model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt"
min_count = 5 #最低词频
word_demension = 200
window_size = 5 # 窗口大小

def load_data(file_path = file_path):
    dataset = []
    for line in open(file_path,encoding='utf-8'):
        line = line.strip().split(',')
        dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1])
    return dataset
dataset = load_data()

# 统计总词数
def build_wrod_dict():
    words = []
    for data in dataset:
        words.extend(data)
    reserved_words = [item for item in collections.Counter(words).most_common() if item[1]>min_count]
    word_dict = {item[0]:item[1] for item in reserved_words}
    return word_dict
# 构造上下文窗口
def build_word2word_dict():
    word2word_dict = {}
    for data_idx, data in enumerate(dataset):
        contexts = []
        for index in range(len(data)):
            if index < window_size:
                left = data[:index]
            else:
                left = data[index-window_size:index]
            if index + window_size > len(data):
                right = data[index + 1:]
            else:
                right = data[index + 1: index + window_size + 1]
            context = left + [data[index]] + right# 得到了一句话中的上下文的窗口
            for word in context:
                if word not in word2word_dict:
                    word2word_dict[word] = {}
                else:
                    for co_word in context:
                        if co_word !=word:
                            word2word_dict[word][co_word] =1
                        else:
                            word2word_dict[word][co_word] += 1
    return word2word_dict
# 构造共现矩阵
def build_word2word_matrix():
    word2word_dict = build_word2word_dict()
    word_dict =build_wrod_dict()
    word_list = list(word_dict)# 这个只会构造出一个word的key
    word2word_matrix = []
    count = 0
    for word1 in word_list:
        count +=1
        temp = []
        sumtf = sum(word2word_dict[word1].values())
        for word2 in word_list:
            weight = word2word_dict[word2].get(word2, 0) / sumtf
            temp.append(weight)
        word2word_matrix.append(temp)
    return word2word_matrix

2.2 基于语言模型

语言模型生成词向量是通过训练神经网络模型附带产出的，一般是采用三层神经网络结构，分别为输入层、隐藏层以及输出层。常见的就是word2vect方法，该方法主要有两种方式，CBOW和skip-gram；

Word2vect的改进方法有两种，一种是基于Hierarchical softmax，另一种是基于负采样。

word2vect最先优化使用的结构是霍夫曼树，来代替隐藏层和输出层的神经元，但其问题就在隐藏层和输出层的softmax计算量很大（因为要计算所有词的softmax概率，再去找最大概率），因此霍夫曼树可以解决这个问题。霍夫曼树的叶子节点起到输出神经元的作用。一般霍夫曼树后会对叶子节点进行编码，由于权重高的叶子节点靠近根节点，而权重低的叶子节点远离根节点，这样权重高的节点编码段短，权重低的编码较长，符合信息论，也就是越是常用的词拥有更短的编码。霍夫曼树当中定义左节点还是右节点里面有个主意的sigmoid函数，因此最后变成了求解Hierarchical Softmax的参数的问题，求解梯度并进行计算。

基于负采样求解word2vect模型的方法摒弃了霍夫曼树，因为霍夫曼树针对样本中心词是一个生僻词时，就得在霍夫曼树中路径寻找很久。比如训练一个样本，中心词是w，他的周围上下文共有2c个词，则记为context(w)。由于这个中心词w和context(w)相关，则它是一个真实的正例；现在通过负采样技术，得到neg个和w不同的中心词wi,i=1,2,…,neg，则context(w)和这个wi组成一个负例子；利用这个正例和neg负例，我们进行二元逻辑回归，得到负采样对应每个词wi对应的模型参数theta，以及每个词的词向量。

简单的对负采样进行总结：

还是假设词库有10000个词，词向量300维，那么每一层神经网络的参数是300万个，输出层相当于有一万个可能类的多分类问题。可以想象，这样的计算量非常非常非常大。采样的思想非常简单，简单地令人发指：我们知道最终神经网络经过softmax输出一个向量，只有一个概率最大的对应正确的单词，其余的称为negative sample。现在只选择5个negative sample，所以输出向量就只是一个6维的向量。要考虑的参数不是300万个，而减少到了1800个！这样做看上去很偷懒，实际效果却很好，大大提升了运算效率。

2.2.1 CBOW（连续词袋模型）

该模型是预测上下文已知的情况下，当前词出现的概率。上下文的选取采用窗口方式。本文基于负采样的TensorFlow下训练cbow的词向量代码如下：

# 连续词袋模型，根据上下文预测当前单词
import math
import numpy as np
import tensorflow as tf
import  collections
file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt"
model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt"
min_count = 5 #最低词频
batch_size = 200 # 每次迭代的数量
embedding_size = 200 # 生成词向量的维度
window_size = 5 # 窗口大小
num_sampled = 100 # 负采样的样本
num_steps = 10000# 最大的迭代次数
def load_data(file_path = file_path):
    dataset = []
    for line in open(file_path,encoding='utf-8'):
        line = line.strip().split(',')
        dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1])
    return dataset
dataset = load_data()
# 获得所有的单词组
def read_data(dataset):
    words = []
    for data in dataset:
        words.extend(data)
    return words
# 创建数据集合
def build_dataset(words,min_count):
    count = [['unk',-1]]
    reserved_words = [item for item in collections.Counter(words).most_common() if item[1]>min_count]
    count.extend(reserved_words)
    dictionary = dict()
    for word,_ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reverse_dictionary = dict(zip(dictionary.values(),dictionary.keys()))
    return data,count,dictionary,reverse_dictionary
# 生成训练的样本
data_index = 0
def generate_batch(batch_size, window_size,data):
    # data的格式为编号
    span = 2*window_size+1
    batch = np.ndarray(shape=(batch_size,span-1),dtype=np.int32)
    labels = np.ndarray(shape=(batch_size,1),dtype=np.int32)
    buffer = collections.deque(maxlen=span)

    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index+1)/len(data)# data中每个元素的下标
    for i in range(batch_size):
        target=window_size
        target2avoid = [window_size]
        col_idx = 0
        for j in range(span):
            if j ==span//2:
                continue
            batch[i,col_idx] = buffer[j]
            col_idx += 1
        labels[i,0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index+1)/len(data)
    return batch,labels
# 进行训练
def train_word2vec(vocabulary_size,batch_size,embedding_size,window_size,num_sampled,num_steps,data):
    graph = tf.Graph()
    with graph.as_default(),tf.device('/cpu:0'):
        train_dataset = tf.placeholder(tf.int32,shape=[batch_size,2*window_size])
        train_labels = tf.placeholder(tf.int32,shape=[batch_size,1])
        embedding = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
        # 这儿与skip-gram不同的是，cbow的输入是上下文向量的均值
        #embed = tf.reshape(embedding,window_size*2,batch_size//(window_size*2),embedding_size)这个方法也可以
        context_embedding = []
        for i in range(2 * window_size):#对每列进行相加，然后取平均值
            context_embedding.append(tf.nn.embedding_lookup(embedding,train_dataset[:,i]))
        ave_embed = tf.reduce_mean(tf.stack(axis=0,values=context_embedding),0,keep_dims=False)
        softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev=1.0/math.sqrt(embedding_size)))
        softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
        # 定义损失函数
        loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(
            weights=softmax_weights,
            biases=softmax_biases,
            inputs=ave_embed,
            labels=train_labels,
            num_sampled = num_sampled,
            num_classes=vocabulary_size
        ))
        opt = tf.train.AdamOptimizer(1.0).minimize(loss)
        norm = tf.sqrt(tf.reduce_mean(tf.square(embedding),1,keep_dims=True))
        normalized_embeddings = embedding/norm
    with tf.Session(graph) as session:
        tf.global_variables_initializer()
        average_loss = 0
        for step in range(num_steps):
            batch_data,batch_labels = generate_batch(batch_size,window_size,data)
            feed_dict = {train_labels:batch_data,train_labels:batch_labels}
            _,l = session.run([opt,loss],feed_dict=feed_dict)
            average_loss += l
            if step % 200 ==0:
                if step>0:
                    average_loss = average_loss/200
                print('average loss at step',step,':',average_loss)
                average_loss = 0
        final_embedding = normalized_embeddings.eval()
    return final_embedding

2.2.2 skip-gram（跳字模型）

原理和CBOW大致相同，只是输入是中心词，输出是周围词词向量。

基于负采样的TensorFlow训练skipgram的词向量代码如下：

# 利用skip-gram进行词向量的训练，是当前单词预测上下文
import collections
import math
import random
import numpy as np
import tensorflow as tf
file_path = "D:\workspace\project\\NLPcase\\word2vec\\data\\data.txt"
model_path = "D:\workspace\project\\NLPcase\\word2vec\\model\\skipgram_word2vec.txt"
min_count = 5 #最低词频
batch_size = 200 # 每次迭代的数量
embedding_size = 200 # 生成词向量的维度
window_size = 5 # 窗口大小
num_sampled = 100 # 负采样的样本
num_steps = 10000# 最大的迭代次数
def load_data(file_path = file_path):
    dataset = []
    for line in open(file_path,encoding='utf-8'):
        line = line.strip().split(',')
        dataset.append([word for word in line[1].split(' ') if 'nbsp' not in word and len(word)<1])
    return dataset
dataset = load_data()
# 获得所有的单词组
def read_data(dataset):
    words = []
    for data in dataset:
        words.extend(data)
    return words
# 创建数据集合
def build_dataset(words,min_count):
    # 把那些低频的词过滤掉，并根据出现频次的大小进行相关的编号
    count = [['UNK',-1]] # 对不统计或者没有出现的进行计数
    count.extend([item for item in collections.Counter(words).most_common() if item[1]>min_count])
    dictionary = dict()
    for word,_ in count:
        dictionary[word] = len(dictionary)# 进行编号
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reverse_dictionary = dict(zip(dictionary.values(),dictionary.keys()))# 形成id：单词，的形式
    return data,dictionary,reverse_dictionary

# 生成训练样本
data_index = 0
def generate_bath(batch_size,window_size,data):
    # 其中data的格式为进行编号的id格式
    # num_skips: 表示为每个单词生成多少个样本，本实验设置的是2个，其中batch_size必须是num_skips的整数倍
    # window_size：一般2*window_size>=num_skips
    batch = np.ndarray(shape=(batch_size),dtype=np.int32)# 建立一个batch大小的一维数组，保存任意单词
    # 建立一个(batch,1)大小的二维数组，保存打次前一个或者后一个从而形成pair，其中1表示预测周围的词的数目
    labels = np.ndarray(shape=(batch_size,1),dtype=np.int32)
    # Sample data [0, 5241, 3082, 12, 6, 195, 2, 3137, 46, 59] ['UNK', 'anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used']
    # 假设取num_steps为2, window_size为1, batchsize为8
    # batch:[5242, 3084, 12, 6]
    # labels[0, 3082, 5241, 12, 3082, 6, 12, 195]
    #print(batch)[5242    5242    3084    3084    12    12    6    6]，共8维
    #print(labels)[[0][3082][12][5242][6][3082][12][195]]，共8维
    span = 2*window_size+1 # 得到一个窗口的大小
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index+1)%len(data)
    # batch_size一定是num-skips的倍数，从而保证每个batch-size都能够用完num-skips
    for i in range(batch_size//(window_size*2)):#保证每个词产生的上下文组合用完
        target = window_size#中心词
        target2avoid = [window_size]#中心词首先被排除
        for j in range(window_size*2):#一个窗口的数据
            while target in target2avoid:
                target = random.randint(0,span-1)
            target2avoid.append(target2avoid)
            batch[i*window_size*2+j] = buffer[window_size]
            labels[i*window_size*2+j,0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    return batch,labels
# 然后构建网络进行训练
def train_wordvec(vocabulary_size,batch_size,embeddingsize,window_size,num_sample,num_steps,data):
    gragh = tf.Graph()
    with gragh.as_default():
        # 输入数据
        train_inputs = tf.placeholder(tf.int32,shape=[batch_size])
        train_labels = tf.placeholder(tf.int32,shape=[batch_size,1])
        # 使用cpu进行训练
        with tf.device('/cpu:0'):
            # 初始化一个embedding
            embedding = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))
            # 查找对应的embedding
            embed = tf.nn.embedding_lookup(embedding_size,train_inputs)
            # 全连接参数定义
            nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_size],stddev=1.0/math.sqrt(embedding_size)))
            nce_bias = tf.Variable(tf.zeros([vocabulary_size]))
            # 定义一个loss
            loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                                 biases=nce_bias,
                                                 inputs=embed,
                                                 num_classes=vocabulary_size,
                                                 num_sampled=num_sampled))
            # 优化方法
            opt = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
            # 计算每个词的模，用于归一化
            norm = tf.sqrt(tf.reduce_sum(tf.square(embedding),1,keep_dims=True))
            normalized = embedding/norm
            # 初始化模型的变量
            init = tf.global_variables_initializer()

        # 基于构造的网络进行训练
        with tf.Session(gragh) as session:
            # 初始化运行
            init.run()
            # 定义平均损失
            average_loss = 0
            for step in range(num_steps):
                batch_inputs,batch_labels = generate_bath(batch_size,window_size,data)
                feed_dict = {train_inputs:batch_inputs,train_labels:batch_labels}
                # 计算每一次迭代的loss
                _,loss = session.run([opt,loss],feed_dict=feed_dict)
                average_loss += loss
                # 每个一段时间将其打印出来
                if step%200 == 0:
                    if step>0:
                        average_loss /=200
                    print('average loss at step',step,":",average_loss)
                    average_loss =0
            final_embedding = normalized.eval()
    return final_embedding

参考文献：

https://blog.csdn.net/mawenqi0729/article/details/80698350

http://www.cnblogs.com/pinard/p/7160330.html

https://blog.csdn.net/u014595019/article/details/54093161

http://www.cnblogs.com/pinard/p/7249903.html

https://blog.csdn.net/rxt2012kc/article/details/71123052

https://blog.csdn.net/leadai/article/details/80249999

https://github.com/liuhuanyong/Word2Vector

Zh823275484

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
词向量

词向量是自然语言处理中重要的基础，有利于我们对文本、情感、词义等等方向进行分析，主要是将词转化为稠密向量，从而使得相似的词，其词向量也相近。一、词向量的表示词向量的表示通常有两种方式，一种是离散的，另一种是分布式的；其离散方式通常称为one-hot representation，其缺点是不能显示词与词之间的关系，但优点是在高维空间中，很多任务线性可分。其分布式的方式通常称为...
复制链接

扫一扫

专栏目录