Task 1: Introduction and Word Vectors
目录
理论部分
- 介绍NLP研究的对象
- 如何表示单词的含义
- Word2Vec方法的基本原理
School:Stanford
Teacher:Prof. Christopher Manning
Library:Pytorch
Lecture Plan
- The course (10 mins)
- Human language and word meaning (15 mins)
- Word2vec introduction (15 mins)
- Word2vec objective function gradients (25 mins)
- Optimization basics (5 mins)
- Looking at word vectors (10 mins or less)
1. How do we represent the meaning of a word?
我们怎样表达一个词的意思?
How do we have usable meaning in a computer?
我们如何在计算机中得到一个可用的词义?
Problems with resources like WordNet
资源存在的问题(例如Wordnet)
Representing words as discrete symbols
用离散符号表示词
Problem with words as discrete symbols
用离散符号表示词存在的问题
Representing words by their context
根据上下文来表示词
Word vectors
词向量
Word meaning as a neural word vector – visualization
词的意义作为一个神经词向量——可视化
2.Word2vec: Overview
Word2vec 概述
Word2vec: objective function
Word2vec:目标函数
Word2Vec Overview with Vectors
向量示例
Word2vec: prediction function
Word2vec:预测函数
Training a model by optimizing parameters
通过优化参数来训练模型
To train the model: Compute all vector gradients!
训练模型:计算所有词向量的梯度
3.Word2vec derivations of gradient
==Word2vec目标函数梯度 ==
Chain Rule
链式法则
Interactive Whiteboard Session!
交互白板会话!
白板推导
我的推导如下:
Calculating all gradients!
计算所有梯度!
Word2vec: More details
Word2vec:更多的细节
4.Optimization: Gradient Descent
优化:梯度下降
Gradient Descent
梯度更新
Stochastic Gradient Descent
随机梯度下降法
实战
# Gensim word vector visualization of various word vectors
import numpy as np
# Get the interactive Tools for Matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn. manifold import TSNE
from sklearn. decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim. models import KeyedVectors
from gensim. scripts .glove2word2vec import glove2word2vec
# 将GloVe文件格式转换为word2vec文件格式
glove_file = datapath('D:\Python-text\\nlp_text\\nlp_datawhale\\task01\\glove.6B\\glove.6B.100d.txt')
word2vec_glove_file = get_tmpfile("D:\Python-text\\nlp_text\\nlp_datawhale\\task01\\glove.6B\\glove.6B.100d.word2vec.txt")
print(glove2word2vec(glove_file, word2vec_glove_file))
# 加载预训练词向量模型
model = KeyedVectors.load_word2vec_format(word2vec_glove_file)
# 与obama最相似的词
print(model.most_similar('obama'))
# 与banana最相似的词
print(model.most_similar('banana'))
print(model.most_similar(negative='banana'))
result = model.most_similar(positive=['woman', 'king'], negative=['man'])
print("{}: {:.4f}".format(*result[0]))
def analogy(x1, x2, y1):
result = model.most_similar(positive=[y1, x2], negative=[x1])
return result[0][0]
# 神经词向量的可视化散点图
def display_pca_scatterplot(model, words=None, sample=0):
if words == None:
if sample > 0:
words = np.random.choice(list(model.vocab.keys()), sample)
else:
words = [word for word in model.vocab]
word_vectors = np.array([model[w] for w in words])
twodim = PCA().fit_transform(word_vectors)[:, :2]
plt.figure(figsize=(6, 6))
plt.scatter(twodim[:, 0], twodim[:, 1], edgecolors='k', c='r')
for word, (x, y) in zip(words, twodim):
plt.text(x + 0.05, y + 0.05, word)
display_pca_scatterplot(model,
['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',
'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',
'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',
'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',
'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',
'homework', 'assignment', 'problem', 'exam', 'test', 'class',
'school', 'college', 'university', 'institute'])
plt.show()
# 样本为300,可视化散点图
display_pca_scatterplot(model, sample=300)
plt.show()
结果:
【参考资料】
斯坦福cs224n-2019链接:https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/
bilibili 视频:https://www.bilibili.com/video/BV1s4411N7fC?p=2