『论文复现系列』3.Glove

最新推荐文章于 2023-04-13 14:44:14 发布

AI Studio

最新推荐文章于 2023-04-13 14:44:14 发布

阅读量381

点赞数

分类专栏：人工智能文章标签：算法人工智能机器学习

原文链接：https://aistudio.baidu.com/aistudio/projectdetail/5456764

版权

人工智能专栏收录该内容

180 篇文章 33 订阅

订阅专栏

★★★ 本文源自AlStudio社区精品项目，【点击此处】查看更多精品内容 >>>

『论文复现系列』3.Glove

Glove

论文 | Global Vectors for Word Representation

链接 | https://nlp.stanford.edu/projects/glove/

作者 | Jeffrey Pennington / Richard Socher / Christopher D. Manning

发布时间 | 2014

一、概述

开始讲解之前，我们先阅读标题，Global体现出Glove是作用于全局，vector for word说明Glove是一个词语的表示方式(词向量），作者也说了，之所以叫Glove的原因就是the global corpus statistics are captured directly by the model.

此外，Glove是基于LSA的global matrix factorization和word2vec的local context window methods结合而成的模型，而LSA的优势在于利用了全局的统计信息，但不擅长做词语的类比，word2vec虽然可以将词语映射到向量空间汇总做词语类比的任务，但没有利用语料中的全局统计信息，作者综合了两者的优点，提出了Glove模型

二、Glove实现原理

1. 构建词语共现矩阵。

和Word2Vec相似，他将词语分为中心词和背景词，统计两者共同出现在同一个窗口的次数。注意，这里引出了窗口的概念，不了解的朋友可以点击word2vec查看俺的上一个复现项目。

举例：假设语料库共三句话 jjy like eat. jjy like sleep, jjy very enjoy NLP,假设窗口大小为1，则得共现矩阵。
$\begin{array}{|c|c|c|c|c|c|c|c|} \hline \text { count } & \text { jjy } & \text { like } & \text { eat } & \text { sleep } & \text { very } & \text { enjoy } & \text { NLP } \\ \hline \text { jjy } & & 2 & & & 1 & & \\ \hline \text { like } & 2 & & 1 & 1 & & & \\ \hline \text { eat } & & 1& & & & & \\ \hline \text { sleep } & & 1& & & & & \\ \hline \text { very } & 1& & & & & 1& \\ \hline \text { enjoy } & & & & & 1& & 1 \\ \hline \text { NLP } & & & & & & 1 & \\ \hline \end{array}$
共现矩阵一定是对称矩阵，并且作者提及矩阵的数据不是真正的共现次数，而是共现次数和权重递减函数的乘积，从而达到窗口范围内，距离越近的(中心，背景)词对权重大，反之则小

2. 利用共现矩阵计算概率

我们需要使用词与词之间的共现 (co-occurrence) 信息。在这里我们定义词 $i$ 和词 $j$ 的共现概率。
$P_{i j}=P(j \mid i)=\frac{x_{i j}}{x_i}$
并且定义 $x_i=\sum_j x_{i j}$

其中：

$x$ ：共现矩阵
$i$ ：中心词语
$j$ ：背景词语
$x_{ij}$ ：中心词 $i$ 与背景词 $j$ 的共现次数
$P_{ij}$ ：中心词 $i$ 周围出现 $j$ 的概率，这也是为词 $j$ 出现在词 $i$ 的环境的概率

不难看出，其实该概率就是单词 $i$ $j$ 同时出现的次数除以单词 $i$ 出现的总共次数，作为当 $i$ 出现的时候 $k$ 出现的概率。

借此作者验证了一个规则。

3. 共现概率比值

Glove 论文里展示了以下一组词对的共现概率与比值

w_(k)	solid	gas	water	fashion
$p_1=P\left(w_k \mid\right.ice)$	0.00019	0.000066	0.003	0.000017
$p_2=P\left(w_k \mid\right.steam)$	0.000022	0.00078	0.0022	0.000018
$p_1 / p_2$	8.9	0.085	1.36	0.96

我们可以观察到以下现象:

对于与 ice 相关而与 steam 不相关的词 $k$ , 例如 $k =$ solid，我们期望共现概率比值 $\frac{P_{i k}}{P_{j k}}$ 较大，例如上面最后一栏的 $8.9$
对于与 ice 不相关而与 steam 相关的词 $k$ , 例如 $k =$ gas, 我们期望共现概率比值 $\frac{P_{i k}}{P_{j k}}$ 较小，例如上面最后一栏的 $0.085$
对于与 ice 和 steam 都相关的词 $k$ ，例如 $k =$ water，我们期望共现概率比值 $\frac{P_{i k}}{P_{j k}}$ 接近 1 ，例如上面最后一栏的 $1.36$
对于与 ice 和 steam 都不相关的词 $k$ ，例如 $k =$ fashion，我们期望共现概率比值 $\frac{P_{i k}}{P_{j k}}$ 接近 1 ，例如上面最后一栏的 $0.96$

总结起来，就是当 $P_{ik}$ 大的时候，即 $i$ 和 $k$ 相关性强的时候，该比值会增大，同理，当 $P_{jk}$ 小的时候，即 $j$ 和 $k$ 相关性强的时候，该比值也会增大。当两个 $P$ 值同大同小，则该比值在 $1$ 附近。这样的话就把强相关，弱相关，不相关很好的区分开。

由此可见，共现概率比值能比较直观的表达词之间的关系。

4. 用词向量表达共现概率比值

Glove 的核心在于使用词向量表达共现概率比值。这里再强调一下，我们称词 $i$ 和词 $j$ 分别为中心词和背景词。接下来使用 $\boldsymbol{v}$ 和 $\tilde{\boldsymbol{v}}$ 分别表示中心词和背景词的词向量。而任意一个这样的比值需要三个词 $i 、 j$ 和 $k$ 的词向量，这里分别定义为 $\boldsymbol{v}_i, \boldsymbol{v}_j, \tilde{\boldsymbol{v}}_k$ 。

对于共现概率 $P_{i j}=P(j \mid i)$ ，我们可以用有关词向量的函数 $f$ 来表达共现概率比值:
$f\left(\boldsymbol{v}_i, \boldsymbol{v}_j, \tilde{\boldsymbol{v}}_k\right)=\frac{P_{i k}}{P_{j k}}$
需要注意的是，函数 $f$ 可能涉及的并不唯一。为了在向量空间里保留向量的线性特征，我们用向量之差来表达共现概率的比值，并将上式改写成
$f\left(\boldsymbol{v}_i-\boldsymbol{v}_j, \tilde{\boldsymbol{v}}_k\right)=\frac{P_{i k}}{P_{j k}}$
由于共现概率比值是一个标量，并且需要保证左式（向量）= 右式（标量），我们可以使用向量之间的内积把函数 $f$ 的自变量进一步改写
$f\left(\left(\boldsymbol{v}_i-\boldsymbol{v}_j\right)^T \tilde{\boldsymbol{v}}_k\right)=\frac{P_{i k}}{P_{j k}}$
由于任意一对词共现的对称性，我们希望以下两个性质可以同时被满足:

任意词作为中心词和背景词的词向量应该相等：对任意词 $\boldsymbol{v}_i=\tilde{\boldsymbol{v}}_i$
词与词之间共现次数矩阵 $X$ 应该对称：对任意词 $i$ 和 $j, x_{i j}=x_{j i}$

为了满足以上两个性质，一方面我们令
$f\left(\left(\boldsymbol{v}_i-\boldsymbol{v}_j\right)^T \tilde{\boldsymbol{v}}_k\right)=\frac{f\left(\boldsymbol{v}_i^T \tilde{\boldsymbol{v}}_k\right)}{f\left(\boldsymbol{v}_j^T \tilde{\boldsymbol{v}}_k\right)}$
并很自然得到 $f(x)=\exp (x)$ 。以上两式右边联立
$\exp \left(\boldsymbol{v}_i^T \tilde{\boldsymbol{v}}_k\right)=P_{i k}=\frac{x_{i k}}{x_i}$
上式两边取对数可得
$\boldsymbol{v}_i^T \tilde{\boldsymbol{v}}_k=\log \left(x_{i k}\right)-\log \left(x_i\right)$
等式的右边 $log(x_i)$ 和 $k$ 完全没关系，且为了满足对称性，我们可以把上式中的 $\log \left(x_i\right)$ 替换成两个偏移项之和 $b_i+b_k$ ，得到
$\boldsymbol{v}_i^T \tilde{\boldsymbol{v}}_k=\log \left(x_{i k}\right)-b_i-\tilde{b_k}$
将索引 $i$ 和 $k$ 互换，我们可以验证对称性的两个性质可以同时被上式满足
因此，对于任意一对词 $i$ 和 $j$ ，用它们的词向量表达共现概率比值最终可以被简化为表达他们共现词频的对数：
$\boldsymbol{v}_i^T \tilde{\boldsymbol{v}}_k+b_i+\tilde{b_k}=\log \left(x_{i k}\right)$

5. 定义损失函数

上式中的共现词频是直接在训练数据上统计得到的，为了学习词向量和相应的偏移项，我们希望上式中的左边与右边越接近越好，给定词典大小 $V$ 和权重函数 $f\left(x_{i j}\right)$ ，我们定义损失函数为
$J=\sum_{i, j=1}^V f\left(x_{i j}\right)\left(\boldsymbol{v}_i^T \tilde{\boldsymbol{v}}_j+b_i+b_j-\log \left(x_{i j}\right)\right)^2$
对于权重 $f (x)$ ，它是用于调整词对的权重，作者认为所有的词对权重不应该完全一样，经常共现的词对和不在一起共现的词对，他们表示的信息是不一样的。

一个建议的选择是，当 $x < ma x$ (例如 $ma x = 100$ )，令 $f(x)=(x / max)^α$ (例如 $α = 0.75)$ ，反之令 $f (x) = 1$ 。
$f(x)=\left\{\begin{array}{cc} \left(x / x_{\max }\right)^\alpha & \text { if } x<x_{\max } \\ 1 & \text { otherwise } \end{array}\right.$

需要注意的是，损失函数的计算复杂度与共现词频矩阵 $x$ 中非零元素的数目呈线性关系。我们可以从 $x$ 中随机采样小批量非零元素，使用随机梯度下降迭代词向量和偏移项。

经过不断的学习，误差会逐渐缩小，得到 $w$ 、 $\tilde{w}$ 分别是同一个词的作为中心词的词向量，和作为背景词的词向量。虽然共现矩阵 $x$ 是对称的，理论上 $w$ 、 $\tilde{w}$ 是一样的，但随机初始值不同导致结果出现差异，作者认为将两个词向量加和能减少过拟合，降噪，所以一般会将加和的权重作为最后的结果。

最后，关于 Glove 的一些公式推导，其实并不严谨，它只是提出了一些设计思路，为了满足那两个条件，共现词频应该设计成什么样。所以如果某一步推导看不懂也很正常，忽略过去就行了，只要知道最终的损失函数的性质就行。作者在文章中也实验了不同向量维度，不同窗口大小等因素对实验结果的影响，想要了解的朋友可以自行前往原文一探究竟。

三、代码实现

数据集为text8，它是深度学习中大部分深度学习框架进行word2vec学习的测试语料

数据类型展示:

anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word…

即没有标点符号，每个单词中间间隔一个空格的英文句子组合

# 只需执行一次
!unzip data/data189923/text8.zip

Archive:  data/data189923/text8.zip
  inflating: text8/text8.dev.txt     
  inflating: text8/text8.test.txt    
  inflating: text8/text8.train.txt

# 模型结构


import paddle
import paddle.nn as nn  #神经网络工具箱
import paddle.nn.functional as F  #神经网络函数
import numpy as np
import sys
import math


class GloveModel(nn.Layer):
    def __init__(self, vocab_size, embed_size):
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_size = embed_size

        #声明v和w为Embedding向量
        self.v = nn.Embedding(vocab_size, embed_size)
        self.w = nn.Embedding(vocab_size, embed_size)
        self.biasv = nn.Embedding(vocab_size, 1)
        self.biasw = nn.Embedding(vocab_size, 1)

        #随机初始化参数
        initrange = 0.5 / self.embed_size
        self.v.weight.data = paddle.uniform(shape = [-int(initrange), int(initrange)])
        self.w.weight.data = paddle.uniform(shape = [-int(initrange), int(initrange)])

    def forward(self, i, j, co_occur, weight):
        vi = self.v(i)
        wj = self.w(j)
        bi = self.biasv(i)
        bj = self.biasw(j)

        similarity = paddle.multiply(vi, wj)
        similarity = paddle.sum(similarity)

        loss = similarity + bi + bj - paddle.log(co_occur)
        
        # 乘以0.5是因为后续w与v的权重相加才是真正的权重，weight在构造数据集时计算好，以便节省网络运行时间。
        loss = 0.5 * weight * loss * loss

        return loss.sum().mean()

    def gloveMatrix(self):
        '''
        获得词向量，这里把两个向量相加作为最后的词向量
        :return:
        '''
        return self.v.weight.numpy() + self.w.weight.numpy()

# 数据集设置

import paddle.io as tud
import paddle

class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, co_matrix, weight_matrix):
         
        self.co_matrix = co_matrix
        self.weight_matrix = weight_matrix
        self.train_set = []

        for i in range(self.weight_matrix.shape[0]):
            for j in range(self.weight_matrix.shape[1]):
                if weight_matrix[i][j] != 0:
                    # 这里对权重进行了筛选，去掉权重为0的项
                    # 因为共现次数为0会导致log(X)变成nan
                    self.train_set.append((i, j))

    def __len__(self):
        '''
        必须重写的方法
        :return: 返回训练集的大小
        '''
        return len(self.train_set)

    def __getitem__(self, index):
        '''
        必须重写的方法
        :param index:样本索引
        :return: 返回一个样本
        '''
        (i, j) = self.train_set[index]
        return i, j, paddle.to_tensor(self.co_matrix[i][j], dtype = "float32"), self.weight_matrix[i][j]

from collections import Counter
from sklearn.metrics.pairwise import  cosine_similarity

import pandas as pd
import numpy as np
import scipy

import time
import math
import random
import sys
import matplotlib.pyplot as plt

# 参数设置
EMBEDDING_SIZE = 50
MAX_VOCAB_SIZE = 2000
WINDOW_SIZE = 5

NUM_EPOCHS = 1  # 为了节约时间，Epoch默认设置为1，调大会有更好的效果（实验效果展示为6轮成果）
BATCH_SIZE = 10
LEARNING_RATE = 0.05

TEXT_SIZE = 20000000
LOG_FILE = "logs/glove-{}.log".format(EMBEDDING_SIZE)
WEIGHT_FILE = "weights/glove-{}.th".format(EMBEDDING_SIZE)

#  数据预处理
def getCorpus(filetype, size):
    if filetype == 'dev':
        filepath = 'text8/text8.dev.txt'
    elif filetype == 'test':
        filepath = 'text8/text8.test.txt'
    else:
        filepath = 'text8/text8.train.txt'

    with open(filepath, "r") as f:
        # 读入一行
        text = f.read()
        # 转换为全小写
        text = text.lower().split()
        # 取文本的长度，为文本的最大长度或size自定义长度
        text = text[: min(len(text), size)]
        # 取最常用的1999词
        vocab_dict = dict(Counter(text).most_common(MAX_VOCAB_SIZE - 1))
        # 设置unk标签的数量为剩下所有未找到的词汇
        vocab_dict['<unk>'] = len(text) - sum(list(vocab_dict.values()))
        # 构建idx2word和word2idx
        idx_to_word = list(vocab_dict.keys())
        word_to_idx = {word:ind for ind, word in enumerate(idx_to_word)}
        # 每个单词的统计总数
        word_counts = np.array(list(vocab_dict.values()), dtype=np.float32)
        # 计算词频
        word_freqs = word_counts / sum(word_counts)
        print("Words list length:{}".format(len(text)))
        print("Vocab size:{}".format(len(idx_to_word)))
    return text, idx_to_word, word_to_idx, word_counts, word_freqs

# 构建共现矩阵
def buildCooccuranceMatrix(text, word_to_idx):
    vocab_size = len(word_to_idx)
    maxlength = len(text)
    
    # 得到所有词汇的index，若无则使用unk的index填充该项
    text_ids = [word_to_idx.get(word, word_to_idx["<unk>"]) for word in text]
    # 初始化共现矩阵
    cooccurance_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)
    print("Co-Matrix consumed mem:%.2fMB" % (sys.getsizeof(cooccurance_matrix)/(1024*1024)))
    # 滑动窗口平移，组装共现矩阵
    for i, center_word_id in enumerate(text_ids):
        # 得到窗口索引
        window_indices = list(range(i - WINDOW_SIZE, i)) + list(range(i + 1, i + WINDOW_SIZE + 1))
        # 处理边界条件
        window_indices = [i % maxlength for i in window_indices]
        # 得到index
        window_word_ids = [text_ids[index] for index in window_indices]
        # 给矩阵当前项数值+1
        for context_word_id in window_word_ids:
            cooccurance_matrix[center_word_id][context_word_id] += 1
        if (i+1) % 1000000 == 0:
            print(">>>>> Process %dth word" % (i+1))
    print(">>>>> Build co-occurance matrix completed.")
    return cooccurance_matrix 

# 构建权重矩阵
def buildWeightMatrix(co_matrix):
    xmax = 100.0
    # 构建与co_matrix规格一样的0矩阵用来初始化权重矩阵
    weight_matrix = np.zeros_like(co_matrix, dtype=np.float32)
    print("Weight-Matrix consumed mem:%.2fMB" % (sys.getsizeof(weight_matrix) / (1024 * 1024)))
    # 对共现矩阵的参数做对应处理(上文原理部分讲解5. 定义损失函数提及)
    for i in range(co_matrix.shape[0]):
        for j in range(co_matrix.shape[1]):
            weight_matrix[i][j] = math.pow(co_matrix[i][j] / xmax, 0.75) if co_matrix[i][j] < xmax else 1
        if (i+1) % 1000 == 0:
            print(">>>>> Process %dth weight" % (i+1))
    print(">>>>> Build weight matrix completed.")
    return weight_matrix

def asMinutes(s):
    h = math.floor(s / 3600)
    s = s - h * 3600
    m = math.floor(s / 60)
    s -= m * 60
    return '%dh %dm %ds' % (h, m, s)

# 分析还需多久完成该任务
def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / percent
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

# 加载静态图模型可以直接使用该函数
def loadModel():
    path = WEIGHT_FILE
    model = GloveModel(MAX_VOCAB_SIZE, EMBEDDING_SIZE)
    model.set_state_dict(paddle.load(path))
    return model

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sized

text, idx_to_word, word_to_idx, word_counts, word_freqs = getCorpus('train', size=TEXT_SIZE)    #加载语料及预处理
co_matrix = buildCooccuranceMatrix(text, word_to_idx)    #构建共现矩阵
weight_matrix = buildWeightMatrix(co_matrix)             #构建权重矩阵
dataset = WordEmbeddingDataset(co_matrix, weight_matrix) #创建dataset
dataloader = tud.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
model = GloveModel(MAX_VOCAB_SIZE, EMBEDDING_SIZE) #创建模型
#model = loadModel()
optimizer = paddle.optimizer.Adagrad(parameters=model.parameters(), learning_rate=LEARNING_RATE) #选择Adagrad优化器


print_every = 10000
save_every = 50000
epochs = NUM_EPOCHS
iters_per_epoch = int(dataset.__len__() / BATCH_SIZE)
total_iterations = iters_per_epoch * epochs
print("Iterations: %d per one epoch, Total iterations: %d " % (iters_per_epoch, total_iterations))

start = time.time()
for epoch in range(epochs):
    loss_print_avg = 0
    iteration = iters_per_epoch * epoch
    for i, j, co_occur, weight in dataloader:
        iteration += 1
        optimizer.clear_grad()   #每一批样本训练前重置缓存的梯度
        loss = model(i, j, co_occur, weight)    #前向传播
        loss.backward()     #反向传播
        optimizer.step()    #更新梯度
        loss_print_avg += loss.item()

        if iteration % print_every == 0:
            time_desc = timeSince(start, iteration / total_iterations)
            iter_percent = iteration / total_iterations * 100
            loss_avg = loss_print_avg / print_every
            loss_print_avg = 0
            with open(LOG_FILE, "a") as fout:
                fout.write("epoch: %d, iter: %d (%.4f%%), loss: %.5f, %s\n" %
                            (epoch, iteration, iter_percent, loss_avg, time_desc))
            print("epoch: %d, iter: %d/%d (%.4f%%), loss: %.5f, %s" %
                    (epoch, iteration, total_iterations, iter_percent, loss_avg, time_desc))
        if iteration % save_every == 0:
            paddle.save(model.state_dict(), WEIGHT_FILE)
paddle.save(model.state_dict(), WEIGHT_FILE)

Words list length:15304686
Vocab size:2000
Co-Matrix consumed mem:15.26MB
>>>>> Process 1000000th word
>>>>> Process 2000000th word
......................................
>>>>> Process 15000000th word
>>>>> Build co-occurance matrix completed.
Weight-Matrix consumed mem:15.26MB
>>>>> Process 1000th weight
>>>>> Process 2000th weight
>>>>> Build weight matrix completed.
epoch: 0, iter: 10000/1604874 (0.6231%), loss: 20.68169, 0h 2m 7s (- 5h 38m 5s)
epoch: 0, iter: 20000/1604874 (1.2462%), loss: 13.95302, 0h 4m 12s (- 5h 34m 3s)
......................................
epoch: 0, iter: 260000/1604874 (16.2006%), loss: 6.19664, 0h 54m 47s (- 4h 43m 27s)
epoch: 1, iter: 270000/1604874 (16.8238%), loss: 1.53651, 0h 56m 56s (- 4h 41m 31s)
......................................
epoch: 5, iter: 1590000/1604874 (99.0732%), loss: 4.98604, 5h 37m 27s (- 0h 3m 9s)
epoch: 5, iter: 1600000/1604874 (99.6963%), loss: 4.97203, 5h 39m 35s (- 0h 1m 2s)

# 1、利用余弦相似度衡量向量之间的相似性

def find_nearest(word, embedding_weights):
    index = word_to_idx[word]
    embedding = embedding_weights[index]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
    return [idx_to_word[i] for i in cos_dis.argsort()[:10]]# 找到前10个最相近词语

glove_matrix = model.gloveMatrix()
for word in ["good", "one", "green", "like", "america", "queen", "better", "paris", "work", "computer", "language"]:
    print(word, find_nearest(word, glove_matrix))

good ['good', 'do', 'make', 'things', 'always', 'actually', 'not', 'should', 'way', 'longer']
one ['one', 'four', 'six', 'eight', 'seven', 'three', 'nine', 'five', 'born', 'zero']
green ['green', '<unk>', 'starting', 'sides', 'shape', 'directly', 'count', 'opening', 'ends', 'square']
like ['like', 'be', 'just', 'with', 'how', 'fact', 'even', 'do', 'same', 'this']
america ['america', 'africa', 'northern', 'australia', 'europe', 'united', 'ireland', 'west', 'britain', 'central']
queen ['queen', 'lord', 'brother', 'prince', 'pope', 'iv', 'henry', 'edward', 'daughter', 'reign']
better ['better', 'look', 'way', 'make', 'find', 'real', 'things', 'will', 'problem', 'hard']
paris ['paris', 'lincoln', 'nine', 'three', 'january', 'april', 'july', 'york', 'louis', 'march']
work ['work', 'life', 'an', 'how', 'works', 'about', 'little', 'time', 'way', 'made']
computer ['computer', 'software', 'programming', 'users', 'programs', 'systems', 'application', 'program', 'internet', 'digital']
language ['language', 'languages', 'modern', 'related', 'see', 'non', 'historical', 'such', 'common', 'source']

# 2、利用向量之间的运算考察词向量的关联性

def findRelationshipVector(word1, word2, word3):
    word1_idx = word_to_idx[word1]
    word2_idx = word_to_idx[word2]
    word3_idx = word_to_idx[word3]
    embedding = glove_matrix[word2_idx] - glove_matrix[word1_idx] + glove_matrix[word3_idx]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in glove_matrix])
    for i in cos_dis.argsort()[:5]:
        print("{} to {} as {} to {}".format(word1, word2, word3, idx_to_word[i]))


findRelationshipVector('man', 'king', 'woman')
findRelationshipVector('america', 'washington', 'france')
findRelationshipVector('good', 'better', 'little')

man to king as woman to king
man to king as woman to married
man to king as woman to james
man to king as woman to born
man to king as woman to ii
america to washington as france to washington
america to washington as france to april
america to washington as france to august
america to washington as france to jean
america to washington as france to ended
good to better as little to better
good to better as little to little
good to better as little to successful
good to better as little to real
good to better as little to success

# 3、利用SVD进行词向量的降维并可视化显示

candidate_words = ['one','two','three','four','five','six','seven','eight','night','ten','color','green','blue','red','black',
                    'man','woman','king','queen','wife','son','daughter','brown','zero','computer','hardware','software','system','program',
                    'america','china','france','washington','good','better','bad']
candidate_indexes = [word_to_idx[word] for word in candidate_words]
choosen_indexes = candidate_indexes
choosen_vectors = [glove_matrix[index] for index in choosen_indexes]

U, S, VH = np.linalg.svd(choosen_vectors, full_matrices=False)
for i in range(len(choosen_indexes)):
    plt.text(U[i, 0], U[i, 1], idx_to_word[choosen_indexes[i]])

coordinate = U[:, 0:2]
plt.xlim((np.min(coordinate[:, 0]) - 0.1, np.max(coordinate[:, 0]) + 0.1))
plt.ylim((np.min(coordinate[:, 1]) - 0.1, np.max(coordinate[:, 1]) + 0.1))
plt.show()

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  if isinstance(obj, collections.Iterator):

sinstance(obj, collections.Iterator):

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lhum6MkW-1677308892218)(main_files/main_12_1.png)]