词向量与词向量拼接_如何拥抱 embedding ？从词向量到句向量的技术详解-阿里云开发者社区...

最新推荐文章于 2022-04-12 16:50:21 发布

weixin_39639643

最新推荐文章于 2022-04-12 16:50:21 发布

阅读量1k

点赞数

文章标签：词向量与词向量拼接

本文链接：https://blog.csdn.net/weixin_39639643/article/details/111535749

版权

小叽导读：13年 Word2vev 横空出世，开启了基于 word embedding pre-trained 的 NLP 技术浪潮，6年过去了，embedding 技术已经成为了 nn4nlp 的标配，从不同层面得到了提升和改进。今天，我们一起回顾 embedding 的理论基础，发现它的技术演进，考察主流 embedding 的技术细节，最后再学习一些实操案例。

从实战角度而言，现在一般把 fastText 作为 word embedding 的首选，如果需要进一步的上下文信息，可以使用 ELMo 等 contextual embeddings。从18年开始，类似 ULMFiT、Bert 这样基于 pre-trained language model 的新范式越来越流行。

Sentence embedding 方面，从15年的 Skip-Thought 到18年的 Quick-Thought，无监督 pre-trained sentence embedding 在工业界越来越多的得到使用。

技术点脑图如下：

Vector Semantics

词语义从语言学角度讲可以细分为不同方面：

同义词 (Synonyms)： couch/sofa, car/automobile

反义词 (Antonyms)： long/short, big/little

词相似性 (Word Similarity)：cat 和 dog 不是同义词，但是它们是相似的。

词相关性 (Word Relatedness)：词和词之前可以是相关的，但不是相似的。coffee 和 cup 是不相似的，但是它们显然是相关的。

语义域/主题模型(semantic field/topic models LDA)：一些词同属于某个 semantic domain，相互之间有强相关性，restaurant(waiter, menu, plate, food, chef),

house(door, roof, kitchen, family, bed)

语义框架和角色(Semantic Frames and Roles)：一些词同属于某个事件的角色，buy, sell, pay 属于一次购买的不同角色。

上/下位词(hypernym/hyponym)：一个词是另一个词的父类叫上位词(hypernym)，反之叫下位词(hyponym), vehicle/car, mammal/dog, fruit/mango。

情感/情绪/观点/评价词(connotations, emotions, sentiment, opinions)：正负向情绪词happy/sad，正反向评价词(great, love)/(terrible, hate)。

一个完美的 Vector Representation 是希望能刻画上面词语义的各个层面，但是显然是不现实的。到目前为止用于表示词语义最成功的模型是 Vector Semantics，它是 word embedding 技术的基础。Vector Semantics 有两部分组成：

distributional hypothesis(所有语义向量的理论基础)：Words that occur in similar contexts tend to have similar meanings，define a word by its distribution in texts。

defining the meaning of a word w as a vector, a point in N-dimensional semantic space, which learns directly from their distributions in texts.

使用 Vector 表示词语义，可以更加方便地计算 word similarity。

Co-occurrence Matrix and Basic Vector Semantics models

Vector Semantics Models 通常基于共现矩阵(co-occurrence matrix)构建，共现矩阵用来表示元素的共现规律，其实就是上文提到的 distributional hypothesis 的具体实现。根据元素的不同，主要有两种 co-occurrence matrix：Term-document matrix(多用于检索)和 word-word matrix (多用于词嵌入)。

■ Term-document matrix and TF-IDF model

Term-document matrix：每一行是 vocabulary 中的一个词，每一列是文章集合中的一篇文章，每一个值代表该词(row)在该文章(column)中出现的次数。文章就可以表示成 column vector，两篇文章出现的词类似一般比较相似(all sports docs will have similar entries)：

Term-document matrix 在信息检索中使用比较多，用于寻找相似文章，但是基础的 Term-document matrix 有一个问题，直接使用共现次数那些高频出现的无意义词(stopwords)会有很大的权重，所以需要引入 TF-IDF model。

TF-IDF(term frequency–inverse document frequency) 是 NLP 的基础加权技术，也是信息检索中主流的共现矩阵加权技术。它用于刻画一个词对于一篇文章的重要性，主要思想是某个词在一篇文章中出现的频率 TF 高，并且在其他文章中很少出现，则认为此词具有很好的类别区分能力。公式如下：

所以可以改进 Term-document matrix 成 TF-IDF weighting matrix，抑制那些非常高频出现的无意义词(比如 good)：

实战话外音：计算 corpus 的各个 tfidf 值，直观的思路是先统计构建 Term-document matrix 然后计算 TF-IDF weighting matrix。

■ Word-word matrix and PMI model

word-word matrix：和 Term-document matrix 类似，只是每一列变成 context word，每一个 cell 代表目标词(row)和 context 词(column)在语料中共现的次数。其中的 context 往往是这个词周围的一个窗口。word-word matrix 常常用于 count-based word embedding 的计算，能很好地捕捉 syntactic/POS 信息(小的 context 窗口)和 semantic 信息(大的 context 窗口)：

最低0.47元/天解锁文章

weixin_39639643

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
词向量与词向量拼接_如何拥抱 embedding ？从词向量到句向量的技术详解-阿里云开发者社区...

小叽导读：13年 Word2vev 横空出世，开启了基于 word embedding pre-trained 的 NLP 技术浪潮，6年过去了，embedding 技术已经成为了 nn4nlp 的标配，从不同层面得到了提升和改进。今天，我们一起回顾 embedding 的理论基础，发现它的技术演进，考察主流 embedding 的技术细节，最后再学习一些实操案例。从实战角度而言，现在一般把 fa...
复制链接

扫一扫