【Literature Reading】Term Set Expansion（术语集扩展/种子词扩展）

WinniToast

已于 2022-06-04 16:57:58 修改

阅读量334

点赞数

文章标签： nlp

于 2022-06-04 15:48:00 首次发布

本文链接：https://blog.csdn.net/awater_17/article/details/125121926

版权

本文深入探讨了NLP领域的术语集扩展问题，主要关注基于多上下文词嵌入的算法。作者提出了一个从头到尾的工作流程，旨在找到具有相似功能性意义的词汇。首先介绍了术语集扩展的两种类型——基于主题和基于功能的相似性。接着，详细阐述了一种迭代算法，包括初始种子集的生成、词嵌入模型的训练、阈值设定、错误单词的二元分类模型筛选等步骤。最后，作者承诺分享自己的实践例子以加深理解。

摘要由CSDN通过智能技术生成

Hello! Today I am going to reading some literature about NLP/Data governance/Platform digital enablement… To recording them, I’ll put my reading notes on my CSDN blog! Welcome to communicating with me!

Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow. Mamou et al. 2018

0 Overview

Overall, this paper is short and pithy. It has 4 sections and mainly proposed an algorithm to helping expanding the terms set having similar functional meaning.
在这里插入图片描述

1 Introduction

For quickly understanding what problem is this paper mainly address for, we should first learn about what is Term Set Expansion? In the following figure, I’ll show you two forms of similarity among terms.

TSE based on Topical similarity
Giving a word, then finding other words having a similar topic with it. For example, we input the word “python” and we want to find some words expressing the same theme with it in our corpus having represented with word vectors using linear bag-of-words. As a result, we found “bytecode”,“high-level programming language”… You will see that these words or phrases are description of “python”.
TSE based on Functional similartity
Again, we input a word “python”. Then it generated some terms like Java, C++, C# … via terms set expansion. You must grasp the difference: Java has similar function with python but not a description or supplyment of python.
Knowing the meaning of the term set expansion, we can consider some more complicated situation. Please look at the following figure. Now, we don’t input a single word anymore. We want to put a set of terms and find the expanded set of it. We add two new words and make two independent term sets. In the first set, “yellow” and “orange” are both a description of color. So words in the expanded set must be also color terms. The second seed set is the same.

2 Term Set Expansion Algorithm Overview

In this section, the author introduced the algorithm of term set expansion elaborately. I understand it as a circular structure, you can look at the folloing flow chart.

We should generate the original seed set using for the first iteration by manual collection.
Trian your word embedding model.
Set a threshold and find some words has high similarity with the centroid of the seed set. In this step, these words are likely to contain words that do not need to be placed in the expended set.
Trian a binary classification model to screen the error words we don’t nedd.
Iterate !

在这里插入图片描述
You can read this following PPT for more details.