自然语言处理向量模型-Word2Vec

最新推荐文章于 2024-01-24 09:49:55 发布

最白の白菜

最新推荐文章于 2024-01-24 09:49:55 发布

阅读量922

点赞数

分类专栏： # 机器学习文章标签：自然语言处理 word2vec python 机器学习

本文链接：https://blog.csdn.net/qq_43966129/article/details/122665381

版权

机器学习专栏收录该内容

21 篇文章 4 订阅

订阅专栏

自然语言处理向量模型-Word2Vec

自然语言处理与深度学习

拼写检查、关键词检索…
文本挖掘（产品价格、日期、时间、地点、人名、公司名）
文本分类
机器翻译
客服系统英语
复杂对话系统

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pY8972Py-1643000556903)(F:\Python学习\唐宇迪-python数据分析与机器学习实战\学习随笔\18自然语言处理-Word2Vec\笔记图片\image-20220121153309877.png)]$

深度学习的基础模型是神经网络，指定学习目标，就可以朝着学习的优化目标前进

为什么需要深度学习？

手工特征耗时耗力, 还不易拓展
自动特征学习快, 方便拓展
深度学习提供了一种通用的学习框架, 可用来表示世界、视觉和语言学信息
深度学习既可以无监督学习, 也可以监督学习

语言模型实例：机器翻译；拼写纠错；智能问答

我今天下午打篮球

p(S)=p(w1,w2,w3,w4,w5,…,wn)
=p(w1)p(w2|w1)p(w3|w1,w2)…p(wn|w1,w2,…,wn-1)

p(S)被称为语言模型，即用来计算一个句子概率的模型

语言模型存在哪些问题呢？1.数据过于稀疏2.参数空间太大 p(wi|w1,w2,…,wi-1) = p(w1,w2,…,wi-1,wi) / p(w1,w2,…,wi-1)

解决方法：

假设下一个词的出现依赖它前面的一个词：
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)…p(wn|w1,w2,…,wn-1)
=p(w1)p(w2|w1)p(w3|w2)…p(wn|wn-1)

假设词典的大小是N则模型参数的量级是 $\left(O\left(N^{n}\right)\right)$

词向量

词是最基本的单位，把词转换为计算机认识的形式。要将词转换为向量。语言空间上词与词之间是有距离的，相似的词离得比较近

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EOZVtK8S-1643000556903)(笔记图片/image-20220121155625602.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yHsVW26s-1643000556905)(笔记图片/image-20220121155655930.png)]

不同的语言构造的向量模型是相近的。

神经网络模型

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jhMWjUl0-1643000556905)(笔记图片/image-20220121155941500.png)]

第一层是输入层，第二层是投影层，将输入的前三个词拼接在一起

训练样本： $\text { (Context }(w), w)$ 包括前n-1个词分别的向量,假定每个词向量大小m
投影层：(n-1)*m 首尾拼接起来的大向量
输出： $\mathbf{y}_{w}=\left(y_{w, 1}, y_{w, 2}, \cdots, y_{w, N}\right)^{\top}$
表示上下文为 $\text { Context }(w)$ 时,下一个词恰好为词典中第i个词的概率
归一化： $\mid \text { Context }(w))=\frac{e^{y_{w, i_{w}}}}{\sum_{i=1}^{N} e^{y_{w, i}}}$

优势：

S1 = ‘’我今天去网咖’’ 出现了1000次
S2 = ‘’我今天去网吧’’ 出现了10次
对于N-gram模型：P(S1) >> P(S2)
而神经网络模型计算的P(S1) ≈ P(S2)

只要语料库中出现其中一个，其他句子的概率也会相应的增大

Hierarchical Softmax

分层的Softmax

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5Cbl1r4l-1643000556906)(笔记图片/image-20220121160738236.png)]

CBOW 是 Continuous Bag-of-Words Model 的缩写，是一种根据上下文的词语预测当前词语的出现概率的模型

$\mathcal{L}=\sum_{w \in \mathcal{C}} \log p(w \mid \text { Context }(w))$

哈夫曼树介绍

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mp7JuHgF-1643000556906)(笔记图片/image-20220121161356677.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QKicITfj-1643000556907)(笔记图片/image-20220121161416732.png)]

走左子树还是右子树，是二分类通常用逻辑回归

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fN0MtBhX-1643000556908)(笔记图片/image-20220121161841712.png)]

CBOW的输入层是上下文的词语的词向量，在训练CBOW模型，词向量只是个副产品，确切来说，是CBOW模型的一个参数。训练开始的时候，词向量是个随机值，随着训练的进行不断被更新）。
投影层对其求和，所谓求和，就是简单的向量加法。
输出层输出最可能的w。由于语料库中词汇量是固定的|C|个，所以上述过程其实可以看做一个多分类问题。给定特征，从|C|个分类中挑一个。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5glX8URR-1643000556908)(笔记图片/image-20220121161958343.png)]

$p^{w}$ 从根结点出发到达 $\mathrm{W} $ 对应叶子结点的路径.
$l^{w} $ 路径中包含结点的个数
$p_{1}^{w}, p_{2}^{w}, \cdots, p_{l^{w}}{w} $ 路径 $p^{w}$ 中的各个节点

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-86NS58FJ-1643000556909)(笔记图片/image-20220121162405064.png)]

$\begin{array}{l} p\left(d_{j}^{w} \mid \mathbf{x}_{w}, \theta_{j-1}^{w}\right)=\left[\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]^{1-d_{j}^{w}} \cdot\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]^{d_{j}^{w}} \\ \mathcal{L}=\sum_{w \in \mathcal{C}} \log p(w \mid \operatorname{Context}(w)) \\ \mathcal{L}=\sum_{w \in \mathcal{C}} \log \prod_{j=2}^{l^{w}}\left\{\left[\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]^{1-d_{j}^{w}} \cdot\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]^{d_{j}^{w}}\right\} \\ \quad=\sum_{w \in \mathcal{C}} \sum_{j=2}^{l^{w}}\left\{\left(1-d_{j}^{w}\right) \cdot \log \left[\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]+d_{j}^{w} \cdot \log \left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]\right\} \end{array}$

梯度上升求解

$\frac{\partial \mathcal{L}(w, j)}{\partial \theta_{j-1}^{w}}=\frac{\partial}{\partial \theta_{j-1}^{w}}\left\{\left(1-d_{j}^{w}\right) \cdot \log \left[\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]+d_{j}^{w} \cdot \log \left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right]\right\}$

sigmoid函数的导数: $\sigma^{\prime}(x)=\sigma(x)[1-\sigma(x)] .$
代入上上式得到: $\left(1-d_{j}^{w}\right)\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right] \mathbf{x}_{w}-d_{j}^{w} \sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right) \mathbf{x}_{w}$
合并同类项得到: $\left[1-d_{j}^{w}-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right] \mathbf{x}_{w}$

$\frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}_{w}}=\left[1-d_{j}^{w}-\sigma\left(\mathbf{x}_{w}^{\top} \theta_{j-1}^{w}\right)\right] \theta_{j-1}^{w}$

$\mathbf{v}(\widetilde{w}):=\mathbf{v}(\widetilde{w})+\eta \sum_{j=2}^{l^{w}} \frac{\partial \mathcal{L}(w, j)}{\partial \mathbf{x}_{w}}, \quad \widetilde{w} \in \operatorname{Context}(w)$

负采样模型(Negative Sampling)

$L^{w}(\widetilde{w})=\left\{\begin{array}{ll} 1, & \widetilde{w}=w: \\ 0, & \widetilde{w} \neq w . \end{array}\right. \text { 负样本那么多, 该如何选取呢? }$

对于一个给定的正样本 $(\operatorname{Context}(w), w)$ , 我们希望最大化

$\begin{array}{l} p(u \mid \text { Context }(w))=\left\{\begin{array}{l} \sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right), \\ 1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right) \end{array}\right. \\ g(w)=\prod_{u \in\{w\} \cup N E G(w)} p(u \mid \text { Context }(w)) \end{array}$

在这里插入图片描述

$g(w)=\sigma\left(\mathbf{x}_{w}^{\top} \theta^{w}\right) \prod_{u \in N E G(w)}\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]$

$\sigma\left(\mathrm{x}_{w}^{\top} \theta^{w}\right) \text { 表示当上下文为 Context }(w) \text { 时, 预测中心词为 } w \text { 的概率: }$

$\sigma\left(\mathrm{x}_{w}^{\top} \theta^{u}\right), u \in N E G(w) \text { 则表示当上下文为 } \operatorname{Context}(w) \text { 时, 预测中心词为 } u \text { 的概率 }$

对于一个给定的语料库 $\mathcal{C} .$

$G=\prod_{w \in \mathcal{C}} g(w)$

$\begin{aligned} \mathcal{L} &=\log G=\log \prod_{w \in \mathcal{C}} g(w)=\sum_{w \in \mathcal{C}} \log g(w) \\ &=\sum_{w \in \mathcal{C}} \log \prod_{u \in\{w\} \cup N E G(w)}\left\{\left[\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]^{L^{w}(u)} \cdot\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]^{1-L^{w}(u)}\right\} \\ &=\sum_{w \in \mathcal{C}} \sum_{u \in\{w\} \cup N E G(w)}\left\{L^{w}(u) \cdot \log \left[\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]+\left[1-L^{w}(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]\right\} \end{aligned}$

$\begin{aligned} \frac{\partial \mathcal{L}(w, u)}{\partial \theta^{u}} &=\frac{\partial}{\partial \theta^{u}}\left\{L^{w}(u) \cdot \log \left[\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]+\left[1-L^{w}(u)\right] \cdot \log \left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]\right\} \\ &=L^{w}(u)\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right] \mathbf{x}_{w}-\left[1-L^{w}(u)\right] \sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right) \mathbf{x}_{w} \\ &=\left\{L^{w}(u)\left[1-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right]-\left[1-L^{w}(u)\right] \sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right\} \mathbf{x}_{w} \\ &=\left[L^{w}(u)-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right] \mathbf{x}_{w} \end{aligned}$

$\theta^{u} \text { 的更新公式可写为 } \theta^{u}:=\theta^{u}+\eta\left[L^{w}(u)-\sigma\left(\mathrm{x}_{w}^{\top} \theta^{u}\right)\right] \mathrm{x}_{w}$

$\frac{\partial \mathcal{L}(w, u)}{\partial \mathbf{x}_{w}}=\left[L^{w}(u)-\sigma\left(\mathbf{x}_{w}^{\top} \theta^{u}\right)\right] \theta^{u}$

$\mathbf{v}(\widetilde{w}):=\mathbf{v}(\widetilde{w})+\eta \sum_{u \in\{w\} \cup N E G(w)} \frac{\partial \mathcal{L}(w, u)}{\partial \mathbf{x}_{w}}, \widetilde{w} \in \operatorname{Context}(w)$