throw 烦人_烦人的简单句子聚类

最新推荐文章于 2022-10-20 10:46:03 发布

weixin_26752765

最新推荐文章于 2022-10-20 10:46:03 发布

阅读量475

点赞数

原文链接：https://medium.com/bobble-engineering/annoyingly-simple-sentence-clustering-12de1316abf4

版权

throw 烦人

Making the machine understand the modalities of any language is the core of almost all the NLP tasks being undertaken. Understanding patterns in the data and being able to segregate data points on the basis of similarities and differences can help in several ways. Some use cases can be :

使机器了解任何语言的形式是几乎所有正在执行的NLP任务的核心。理解数据中的模式并能够基于相似点和不同点来分离数据点可以通过多种方式提供帮助。一些用例可以是：

One may scale down massive data which might have many unique data points, but are contextually similar in their meaning. For example in question-answer verification systems, one of the major problems is the repeating of answer statements and these data points are actually duplicate statements and should be removed based on a threshold of semantic similarity.
一个人可能会按比例缩小海量数据，这些数据可能有许多独特的数据点，但上下文含义相似。例如，在问题-答案验证系统中，主要问题之一是答案语句的重复，并且这些数据点实际上是重复语句，应基于语义相似性阈值将其删除。
In classification tasks and especially while dealing with macaronic languages one can map a cluster of similar data points to a label, by relying on the similarity metric calculated.
在分类任务中，尤其是在处理Macaronic语言时，可以依靠所计算的相似性度量将一组相似的数据点映射到标签。
Many natural language applications,sentiment and non-sentiment, such as semantic search, summarization, question answering, document classification, and sentiment analysis, plagiarism depend on sentence similarity.
natural窃的许多自然语言应用(情感和非情感)(例如语义搜索，摘要，问题解答，文档分类和情感分析)都取决于句子的相似性。

Macaronic Languages and the peril involved:

Macaronic语言及其涉及的危险：

While working with macaronic languages, where there’s no grammar structure, no defined vocabulary and no basic rules to govern sentence formation, one can only rely on numbers to tell what data speaks. That is and will always be something AI/ML engineers have to keep in mind, feed the data in the language machine understands, because at the end of the day they crunch numbers to tell a story hidden in the data.

在使用Macaronic语言时，那里没有语法结构，没有定义的词汇，也没有基本的规则来控制句子的形成，但是人们只能依靠数字来说明数据在说什么。那是并且永远是AI / ML工程师必须记住的事情，以语言机器能够理解的方式提供数据，因为最终他们会处理数字以讲述隐藏在数据中的故事。

Let’s begin with an example here, sentence is in Hinglish(Hindi written using latin chars with usage of english words in between) to understand the difference between semantic and lexical similarity:

让我们从这里的一个例子开始，用Hinglish(印度语使用拉丁字符写成的印度语，中间使用英语单词)来理解语义和词汇相似性之间的区别：

Sentence 1 : “Mood kharab hai yaar aaj”

句子1：“心情kharab hai yaar aaj”

Sentence 2: “Mood kharab mat kar”

句子2：“心情kharab mat kar”

While sentence 1 has tones of sadness and disappointment, sentence 2 has more of anger connotations. These sentences are close in terms of lexical similarity but are placed at a distance in terms of semantic similarity.

句子1具有悲伤和失望的语气，而句子2具有更多的愤怒涵义。这些句子在词汇相似性方面很接近，但在语义相似性上却相距遥远。

To make the machine understand the difference between the two and map them to the respective labels, a solution can be devised by using what Annoy (Approximate Nearest Neighbors Oh Yeah) has to offer.(https://github.com/spotify/annoy). It is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

为了使机器了解两者之间的区别并将它们映射到相应的标签，可以使用Annoy( 近似最近的邻居 Oh Yeah)提供的解决方案来设计解决方案。( https://github.com/spotify/annoy )。这是一个具有Python绑定的C ++库，用于搜索空间中与给定查询点接近的点。它还会创建大型的基于文件的只读数据结构，这些数据结构被映射到内存中，以便许多进程可以共享相同的数据。

Let’s understand more with a task example:

让我们通过任务示例了解更多信息：

Say we want to build a model that predicts emojis depending on the text input given by a user. The data has been prepared in the following format :

假设我们要建立一个根据用户输入的文字来预测表情符号的模型。数据已按照以下格式准备：

badhai ho 💐
Badhai ho💐
many many returns of the day happy birthday 🎂
生日快乐很多回报returns
be the way u r always strong personality 💪
成为您始终坚强的个性💪

One emoji per sentence, but the issue came when similar context sentences and in some cases same sentences with some minor changes in the words used were mapped to different emojis. Now this would confuse the model, so we tweaked the task from Multi Class Classification to Multi Label Classification. But even in that case, mapping similar context sentences to a unique emoji cluster was the ideal way to go forward.

每个句子一个表情符号，但是问题是当相似的上下文句子以及在某些情况下相同的句子(使用的单词略有变化)映射到不同的表情符号时出现。现在这会使模型感到困惑，因此我们将任务从“多类分类”调整为“多标签分类”。但是即使在那种情况下，将相似的上下文句子映射到唯一的表情符号群集也是前进的理想方法。

Strategy Adopted:

采用的策略：

First and foremost step to get the word vectors of the words in our corpus, remember it’s a macaronic language so no pre-trained model will work here.

要获得我们语料库中单词的单词向量的第一步也是最重要的一步，请记住，它是一种通配语言，因此此处没有任何预训练模型。

Word Vectors using Fasttext:
使用Fasttext的单词向量：

A popular idea in modern machine learning is to represent words by vectors. These vectors capture hidden information about a language, like word analogies or semantic. It is also used to improve performance of text classifiers. Fasttext model was used to unsupervised train on the User Chat Data gathered with emojis and other special characters removed and only one sentence per line.

现代机器学习中一个流行的想法是用向量表示单词。这些向量捕获有关语言的隐藏信息，例如单词类比或语义。它还用于提高文本分类器的性能。快速文本模型用于对用户聊天数据进行无监督训练，其中收集了表情符号和其他特殊字符，每行仅一个句子。

Sentence Vectors :
句子向量：

The word vectors were used to form a sentence vector by taking an average of all the vectors fetched(one vector for each word) for a sentence and dividing it by the number of words present.

通过取一个句子的所有向量的平均值(每个单词一个向量)，然后将其除以出现的单词数，即可使用单词向量形成句子向量。

Clustering similar sentences using ANNOY:
使用ANNOY对相似的句子进行聚类：

Using ANNOY we formed a forest of trees that stored index vise the sentences found similar using the sentence vector given. For a query sentence given, it would give us a list of k similar sentences and their index position in the dataset. Depending on the extent of angular distance between the query sentence and the similar sentences, a threshold was decided to collect as many sentences that fulfilled the threshold criterion in our dataset. These clusters of sentences were then mapped to the most frequent 5 topmost emojis in that cluster. However, at times there were less than 5 emojis found, therefore we had a minimum of one emoji and maximum of 5 emoji being mapped to the similar cluster formed depending on the threshold.

使用ANNOY，我们形成了一个树木树林，其中存储了索引虎钳，这些虎钳使用给定的句子向量找到的句子相似。对于给定的查询语句，它将为我们提供k个相似语句的列表及其在数据集中的索引位置。根据查询语句和相似语句之间的角度距离的大小，确定阈值以收集尽可能多的满足我们数据集中阈值标准的语句。然后将这些句子簇映射到该簇中最常见的5个最上面的表情符号。但是，有时发现的表情符号少于5个，因此根据阈值，我们将最少1个表情符号和最多5个表情符号映射到形成的相似簇。

The below function returns a set of indices of all the sentences that were found to be similar for an angular distance of ≤ threshold. input_df is the dataframe which has sentences.
下面的函数返回所有句子的一组索引，这些索引对于≤阈值的角距被发现是相似的。 input_df是具有句子的数据框。

def get_neighbours(index): k = 50 #number of neighbours being considered

def get_neighbours(index)：k = 50＃正在考虑的邻居数

sentence_vector = tree.get_item_vector(index) ids,distance = t.get_nns_by_vector(sentence_vector,k,include_distances=True)

句子向量= tree.get_item_vector(索引)id，距离= t.get_nns_by_vector(句子向量，k，包含距离=真)

similarity = distance[-1] #gives the index and distance of last similar neighbour threshold = n # to be decided on respective task while (similarity < threshold): k= 2*k #seraches for more sentences that lie within the threshold criterion. ids,distance = t.get_nns_by_vector(sentence_vector,k,include_distances=True) similarity = distance[-1]

相似度=距离[-1]＃给出最后一个相似邻居阈值的索引和距离= n＃将在相应任务上决定，而(相似度<阈值)：k = 2 * k＃对位于阈值标准内的更多句子进行排序。 ids，距离= t.get_nns_by_vector(sentence_vector，k，include_distances = True)相似度=距离[-1]

indices = extract_index(ids,distance,threshold)

索引= extract_index(ids，距离，阈值)

return indices

回报指数

Tree is built by calling the given python code:

通过调用给定的python代码来构建树：

from annoy import AnnoyIndex
import random
f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)
t.build(10) # 10 trees
t.save('test.ann')
# ...
u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

Eg:
例如：

Query Sentence : happy birthday, the closer the angular distance to 0, better is the similarity.

查询语句：生日快乐，角度距离越接近0，相似度越好。

Similar sentences and respective Angular distance :

类似的句子和各自的Angular距离：

birthday happy 0.0

生日快乐0.0

happy birthday happy birthday 0.0

生日快乐生日快乐0.0

birthday happy birthday happy 0.0

生日快乐生日快乐0.0

happy birthday happy birthday yaar 0.14156264066696167

生日快乐生日快乐yaar 0.14156264066696167

happy birthday happy birthday hap 0.15268754959106445

生日快乐生日快乐0.15268754959106445

happy happy wala birthday birthday 0.16257968544960022

开心开心wala生日生日0.16257968544960022

happy birthday maa happy birthday 0.17669659852981567

生日快乐马阿生日快乐0.17669659852981567

This entire cluster was mapped to this emoji mapping : [🎂 ,😘 ,😍, 🙏, 😂 ]

整个群集被映射到以下表情符号映射：[🎂，😘，😍，🙏，😂]

Similarly:

类似地：

i’m very very happy today [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我今天非常非常快乐['😊'，'😍'，'😁'，'😘']

so i’m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

所以我很高兴['😊'，'😍'，'😁'，'😘']

i’m so very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我非常高兴['😊'，'😍'，'😁'，'😘']

yes i m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

是的，我非常高兴['😊'，'😍'，'😁'，'😘']

yes am very happy today [‘😊’, ‘😍’, ‘😁’, ‘😘’]

是的，今天很开心['😊'，'😍'，'😁'，'😘']

i also very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我也很高兴['😊'，'😍'，'😁'，'😘']

im so very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我非常高兴['😊'，'😍'，'😁'，'😘']

i am very very hppy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我非常hppy ['😊'，'😍'，'😁'，'😘']

i am very happy friends [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我是非常快乐的朋友['😊'，'😍'，'😁'，'😘']

oh i’m very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

哦，我很高兴['😊'，'😍'，'😁'，'😘']

i am very very happ [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我非常非常开心['😊'，'😍'，'😁'，'😘']

im very happy kal [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我非常高兴kal ['😊'，'😍'，'😁'，'😘']

i am very haappy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

我很开心['😊'，'😍'，'😁'，'😘']

sister l am very happy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

姐姐我很高兴['😊'，'😍'，'😁'，'😘']

today i’m so very happpy [‘😊’, ‘😍’, ‘😁’, ‘😘’]

今天我非常开心['😊'，'😍'，'😁'，'😘']

This entire exercise helped us to map the similar context in sentences to the same emoji vector. It is important for the model to understand that such context can be mapped to these 5 emojis and not to each one of them as a different class, but as different labels. Thus, it becomes a problem of Multi-label classification. Multi-label classification originated from the investigation of text categorization problem, where each document may belong to several predefined topics simultaneously. In our case each text can belong to several or single emoji.

整个练习帮助我们将句子中的相似上下文映射到相同的表情符号矢量。对于模型而言，重要的是要理解可以将此类上下文映射到这5个表情符号，而不是将每个上下文映射为不同的类，而是映射为不同的标签。因此，这成为多标签分类的问题。多标签分类源自对文本分类问题的研究，其中每个文档可能同时属于几个预定义的主题。在我们的情况下，每个文本可以属于多个或单个表情符号。