cs224n作业_Assignment 1: Exploring Word Vectors Part 1

最新推荐文章于 2024-07-08 15:19:40 发布

ylclll

最新推荐文章于 2024-07-08 15:19:40 发布

阅读量783

点赞数 23

文章标签： word

本文链接：https://blog.csdn.net/ylclll/article/details/140251332

版权

最近速成cs224n这门课程，看完了课程和笔记尝试写一下课后作业。

我跟着B站上的cs224n2021，感觉讲的很好，就是跟我下载的笔记不太一样，不是很好对照。

本人英语渣，尽量翻译成英文。

首先导入必要的包：

# All Import Statements Defined Here
# Note: Do not add to this list.
# ----------------

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = '<START>'
END_TOKEN = '<END>'

np.random.seed(0)
random.seed(0)
# ----------------

因为在斯坦福课程官网看课程资料导致我一直挂着梯子，所以最开始下载包一直失败，这时候关掉梯子就行了。

略过绘制共现词嵌入，代码如下：

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)

首先看第一题，

实现`distinct_words：`

编写一个方法来计算语料库中出现的不同单词（单词类型）。您可以使用for循环来执行此操作，但使用 Python 列表推导式执行此操作效率更高。特别是，这可能对展平列表列表很有用。如果您不熟悉 Python 列表推导式，这里有更多信息。

您返回的数据corpus_words应该已排序。您可以使用 python 的sorted函数来实现这一点。

您可能会发现使用Python 集删除重复的单词很有用。

这里就是直接遍历输入的字符串列表，题目提示我们可以用py的列表推导式

def distinct_words(corpus):
    """ 确定语料库的不同单词列表。
        参数：
            corpus（字符串列表列表）：文档语料库
        返回：
            corpus_words（字符串列表）：语料库中不同单词的排序列表
            num_corpus_words（整数）：语料库中不同单词的数量
    """
    corpus_words = []
    num_corpus_words = -1

    # ------------------
    # Write your implementation here.
    #第一个word是遍历到的单词，构成所有单词的列表
    #for sublist in corpus遍历传入列表的子列表
    #for word in sublist遍历子列表的每个word
    all_words = [word for sublist in corpus for word in sublist]

    #使用collections包的Counter函数直接计数，保存每个单词和次数
    word_counts = Counter(all_words)

    #获取键值即单词，并排序
    corpus_words = sorted(word_counts.keys())

    #计数
    num_corpus_words = len(corpus_words)

    # ------------------

    return corpus_words, num_corpus_words

第二题，

实现`compute_co_occurrence_matrix：`

编写一个方法，为特定的窗口大小构建一个共现矩阵n𝑛（默认值为 4），考虑单词n𝑛之前和n𝑛在窗口中心的单词后面。在这里，我们开始使用来表示向量、矩阵和张量。如果您不熟悉 NumPy，本 cs231n Python NumPy 教程numpy (np)的后半部分有一个 NumPy 教程。

先初始化数组和映射表，遍历文档每个单词，同时计算窗口大小，如果找到相同单词却在不同位置，则给矩阵M相应位置加1.

def compute_co_occurrence_matrix(corpus, window_size=4):
    """
    计算给定语料库和窗口大小（默认为4）下的共现矩阵。

    注意：文档中的每个单词都应位于一个窗口的中心。靠近边缘的单词将有较少的共现单词。

    例如，如果我们取文档"<START> All that glitters is not gold <END>"和窗口大小为4，
    那么"Al l"将与"<START>"，"that"，"glitters"，"is"和"not"共现。

    参数：
        corpus（字符串列表的列表）：文档的语料库
        window_size（整数）：上下文窗口的大小

    返回：
        M（一个形状为（语料库中唯一单词数量，语料库中唯一单词数量）的对称numpy矩阵）：
            单词计数的共现矩阵。
            行和列中单词的顺序应该与distinct_words函数给出的单词顺序相同。
        word2ind（字典）：将单词映射到索引（即矩阵M的行/列号）的字典。
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2ind = {}

    # ------------------
    # Write your implementation here.
    #初始化矩阵
    M = np.zeros((num_words, num_words))
    # 创建单词到索引的映射
    word2ind = {word: idx for idx, word in enumerate(words)}

    # 遍历语料库中的每个文档
    for doc in corpus:
        # 对于每个文档，遍历其中的每个单词
        for i, word in enumerate(doc):
            # 获取当前单词的索引
            central_word_idx = word2ind[word]
            # 确定窗口的起始和结束位置
            start = max(i - window_size, 0)
            end = min(i + window_size + 1, len(doc))
            # 更新共现矩阵
            for j in range(start, end):
                if i != j:
                    # 获取共现单词的索引
                    context_word_idx = word2ind[doc[j]]
                    # 增加共现计数
                    M[central_word_idx, context_word_idx] += 1
    # ------------------

    return M, word2ind

第三题，

实现`reduce_to_k_dim：`

构建一种对矩阵进行降维以生成 k 维嵌入的方法。使用 SVD 取前 k 个分量并生成一个新的 k 维嵌入矩阵。

注意：sklearn numpy、scipy、scikit-learn（）都提供了 SVD 的一些实现，但只有 scipy 和 sklearn 提供了 Truncated SVD 的实现，只有 sklearn 提供了高效的随机化算法来计算大规模 Truncated SVD。因此请使用sklearn.decomposition.TruncatedSVD。

这题已经提示使用sklearn库了，因此直接用

def reduce_to_k_dim(M, k=2):
    """
使用Scikit-Learn库中的以下截断SVD函数，将维度为（语料库中唯一单词数量，语料库中唯一单词数量）的共现计数矩阵减少到维度为（语料库中唯一单词数量，k）的矩阵：
    - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

参数：
    M（形状为（语料库中唯一单词数量，语料库中唯一单词数量）的NumPy矩阵）：单词计数的共现矩阵
    k（整数）：降维后每个单词的嵌入大小

返回：
    M_reduced（形状为（语料库单词数量，k）的NumPy矩阵）：k维单词嵌入矩阵。
            就数学课上的SVD而言，实际上返回的是U * S的乘积。
    """
    # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))

    # ------------------
    # Write your implementation here.
    # 初始化TruncatedSVD对象，设置目标维度为k
    svd = TruncatedSVD(n_components=k)

    # 拟合并转换数据，得到降维后的矩阵
    M_reduced = svd.fit_transform(M)
    # ------------------

    print("Done.")
    return M_reduced

第四题，

实现`plot_embeddings：`

在这里，您将编写一个函数来在二维空间中绘制一组二维向量。对于图形，我们将使用 Matplotlib ( plt)。

对于此示例，您可能会发现调整此代码很有用。将来，制作图表的一个好方法是查看Matplotlib 图库，找到一个看起来有点像您想要的图表，然后调整它们提供的代码。

def plot_embeddings(M_reduced, word2ind, words):
    """
    在散点图中绘制列表“words”中指定单词的嵌入表示。
    注意：不要绘制M_reduced / word2ind中列出的所有单词。
    在每个点旁边包含一个标签。

    参数：
        M_reduced（形状为（语料库中唯一单词数量，2）的NumPy矩阵）：2维单词嵌入矩阵
        word2ind（字典）：将单词映射到矩阵M的索引的字典
        words（字符串列表）：我们想要可视化其嵌入的单词列表
    """

    # ------------------
    # 提取需要可视化的单词的嵌入坐标
    coordinates = [M_reduced[word2ind[word]] for word in words]
    # 解包坐标列表为x和y坐标
    x_coords, y_coords = zip(*coordinates)
    # 创建散点图
    plt.figure(figsize=(10, 10))
    plt.scatter(x_coords, y_coords, marker='x', color='red')
    # 在每个点旁边添加标签
    for label, x, y in zip(words, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')


    plt.show()
    # ------------------

第五题，

共现情节分析：

现在我们将您编写的所有部分放在一起！我们将在路透社“原油”（石油）语料库上计算固定窗口为 4（默认窗口大小）的共现矩阵。然后我们将使用 TruncatedSVD 计算每个单词的二维嵌入。TruncatedSVD 返回 U*S，因此我们需要对返回的向量进行规范化，以便所有向量都出现在单位圆周围（因此接近度是方向接近度）。注意：下面执行规范化的代码行使用了 NumPy 的广播概念。如果您不了解广播，请查看 Jake VanderPlas 的“数组上的计算：广播”。

运行下面的单元格以生成图表。运行可能需要几秒钟。在二维嵌入空间中，哪些东西会聚集在一起？哪些东西你认为应该聚集在一起，但却没有聚集在一起？ 注意： “bpd”代表“桶/天”，是原油主题文章中常用的缩写。

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'iraq']

plot_embeddings(M_normalized, word2ind_co_occurrence, words)

最后运行，结果如下：

希望每个人学有所成。

ylclll

关注

23
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
cs224n作业_Assignment 1: Exploring Word Vectors Part 1

TruncatedSVD 返回 U*S，因此我们需要对返回的向量进行规范化，以便所有向量都出现在单位圆周围（因此接近度是方向接近度）。实现，但只有 scipy 和 sklearn 提供了 Truncated SVD 的实现，只有 sklearn 提供了高效的随机化算法来计算大规模 Truncated SVD。在这里，我们开始使用来表示向量、矩阵和张量。我跟着B站上的cs224n2021，感觉讲的很好，就是跟我下载的笔记不太一样，不是很好对照。，找到一个看起来有点像您想要的图表，然后调整它们提供的代码。
复制链接

扫一扫