NLP模型笔记2022-16：词向量、中文词向量的训练与中文词向量论文综述

源代码杀手

已于 2022-07-06 22:57:38 修改

阅读量951

点赞数

分类专栏：自然语言处理笔记与知识图谱专栏文章标签：自然语言处理人工智能深度学习

于 2022-06-13 09:42:57 首次发布

本文链接：https://blog.csdn.net/weixin_41194129/article/details/125254368

版权

自然语言处理笔记与知识图谱专栏专栏收录该内容

85 篇文章 46 订阅 ¥49.90 ¥99.00

订阅专栏

超级会员免费看

本文介绍了词向量技术，包括one-hot和分布表示方法，以及如何通过共现矩阵和SVD降维得到稠密词向量。重点讨论了语言模型如CBOW生成词向量，并列举了多个中文词向量训练资源和相关论文综述。

摘要由CSDN通过智能技术生成

介绍

简而言之，词向量技术是将词转化成为稠密向量，并且对于相似的词，其对应的词向量也相近。

在自然语言处理任务中，首先需要考虑词如何在计算机中表示。通常，有两种表示方式：one-hot representation和distribution representation。

生成词向量：

通过统计一个事先指定大小的窗口内的word共现次数，以word周边的共现词的次数做为当前word的vector。具体来说，我们通过从大量的语料文本中构建一个共现矩阵来定义word representation。
例如，有语料如下：

I like deep learning.
I like NLP.
I enjoy flying.

则其共现矩阵如下：计算与关键词前后关联的相同词的数据量，例如like前后有deep\I\NLP。
在这里插入图片描述

矩阵定义的词向量在一定程度上缓解了one-hot向量相似度为0的问题，但没有解决数据稀疏性和维度灾难的问题。

既然基于co-occurrence矩阵得到的离散词向量存在着高维和稀疏性的问
题，一个自然而然的解决思路是对原始词向量进行降维，从而得到一个稠密的连续词向量。

进行SVD分解，计算方法参考（https://blog.csdn.net/qq_56780627/article/details/122163081），得到矩阵正交矩阵U，对U进行归一化得到矩阵如下：
在这里插入图片描述
SVD得到了word的稠密（dense）矩阵，该矩阵具有很多良好的性质：语义相近的词在向量空间相近，甚至可以一定程度反映word间的线性关系。

语言模型生成词向量是通过训练神经网络语言模型NNLM（neural network language model），词向量做为语言模型的附带产出。NNLM背后的基本思想是对出现在上下文环境里的词进行预测，这种对上下文环境的预测本质上也是一种对共现统计特征的学习。
较著名的采用neural network language model生成词向量的方法有：Skip-gram、CBOW、LBL、NNLM、C&W、GloVe等。接下来，以目前使用最广泛CBOW模型为例，来介绍如何采用语言模型生成词向量。

引用来自：https://blog.csdn.net/mawenqi0729/article/details/80698350

论文

Component-Enhanced Chinese Character Embeddings
这是一篇2015年发表在EMNLP(Empirical Methods in Natural Language Processing)会议上的论文，作者来自于香港理工大学 — 李嫣然。

Joint Learning of Character and Word Embeddings
这是一篇2015年发表在IJCAI (International Joint Conference on Artificial Intelligence)会议上的论文，作者来自于清华大学 — 陈新雄，徐磊。

Improve Chinese Word Embeddings by Exploiting Internal Structure
这是一篇2016年发表在NAACL-HLT(Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)会议上的论文，作者来自于中国科学技术大学 — Jian Xu。

Multi-Granularity Chinese Word Embedding
这是一篇2016年发表在EMNLP(Empirical Methods in Natural Language Processing)会议上的论文，作者来自于信息内容安全技术国家工程实验室 — 殷荣超。

Learning Chinese Word Representations From Glyphs Of Characters
这是一篇2017年发表在EMNLP(Empirical Methods in Natural Language Processing)会议上的论文，作者来自于台湾大学 — Tzu-Ray Su 和 Hung-Yi Lee。

Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components
这是一篇2017年发表在EMNLP(Empirical Methods in Natural Language Processing)会议上的论文，作者来自于香港科技大学 — Jinxing Yu。

Enriching Word Vectors with Subword Information
这是一篇2017年发表在ACL(Association for Computational Linguistics)会议上的论文，作者来自于Facebook AI Research — Piotr Bojanowski ，Edouard Grave 。

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
这是一篇2018年发表在AAAI 2018(Association for the Advancement of Artificial Intelligence 2018)会议上的论文，作者来自于蚂蚁金服人工智能部 — 曹绍升 。

Radical Enhanced Chinese Word Embedding
这是一篇2018年发表在CCL2018(The Seventeenth China National Conference on Computational Linguistics, CCL 2018)会议上的论文，作者来自于电子科技大学 — Zheng Chen 和 Keqi Hu 。

Glyce: Glyph-vectors for Chinese Character Representations
2019年，香侬科技提出了一种汉字字形向量 Glyce。根据汉字的进化过程，采用了多种汉字古今文字和多种书写风格，专为中文象形文字建模设计了一种田字格 CNN架构。Glyce 在13个任务上面达到了很好的性能。

【论文笔记】中文词向量论文综述（一）
【论文笔记】中文词向量论文综述（二）
【论文笔记】中文词向量论文综述（三）
【论文笔记】中文词向量论文综述（四）

中文词向量

https://fasttext.cc/docs/en/crawl-vectors.html
https://blog.csdn.net/promisejia/article/details/102923919
在这里插入图片描述
https://fasttext.cc/docs/en/aligned-vectors.html
FastText词向量训练：https://blog.csdn.net/liaoningxinmin/article/details/122921392
https://github.com/zlsdu/Word-Embedding/blob/master/fasttext_report.md
https://blog.csdn.net/sinat_28015305/article/details/109467311
https://zhuanlan.zhihu.com/p/66950847
https://icode.best/i/89914246262657
https://www.daimajiaoliu.com/daima/479437f4e1003fe

References

[1] Component-Enhanced Chinese Character Embeddings
[2] Joint Learning of Character and Word Embeddings
[3] Improve Chinese Word Embeddings by Exploiting Internal Structure
[4] Multi-Granularity Chinese Word Embedding
[5] Learning Chinese Word Representations From Glyphs Of Characters
[6] Joint Embeddings of Chinese Words, Characters, and Fine-grained Subcharacter Components
[7] Enriching Word Vectors with Subword Information
[8] cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
[9] Radical Enhanced Chinese Word Embedding
[10] Glyce: Glyph-vectors for Chinese Character Representations

源代码杀手

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
NLP模型笔记2022-16：词向量、中文词向量的训练与中文词向量论文综述

Component-Enhanced Chinese Character Embeddings这是一篇2015年发表在EMNLP(Empirical Methods in Natural Language Processing)会议上的论文，作者来自于香港理工大学 — 李嫣然。Joint Learning of Character and Word Embeddings这是一篇2015年发表在IJCAI (International Joint Conference on Artificial Inte
复制链接

扫一扫