nlu训练_nlu chatbot训练数据第2部分嵌入的质量指标

最新推荐文章于 2024-02-22 23:28:52 发布

weixin_26729375

最新推荐文章于 2024-02-22 23:28:52 发布

阅读量267

点赞数

文章标签： python 深度学习机器学习算法 tensorflow

原文链接：https://medium.com/analytics-vidhya/quality-metrics-for-nlu-chatbot-training-data-part-2-embeddings-57aa341d81fa

版权

nlu训练

What are Embeddings? What is similarity, cohesion and separation?

什么是嵌入？ 什么是相似性，内聚性和分离性？

This article series provides an introduction to important quality metrics for your NLU engine and your chatbot training data. We will focus on practical usage of the introduced metrics, not on the mathematical and statistical background — I will add links to other articles for this purpose.

本系列文章介绍了NLU引擎和聊天机器人培训数据的重要质量指标。我们将重点放在引入指标的实际用法上，而不是数学和统计背景上—为此，我将添加指向其他文章的链接。

This is part 2 of the Quality Metrics for NLU/Chatbot Training Data series of articles.

这是NLU / Chatbot培训数据质量指标系列的第2部分。

For this article series, you should have an understanding what NLU and NLP is and about the involved vocabulary (intent, entity, utterance) and concepts (intent resolution, intent confidence, user examples).

对于本系列文章，您应该了解NLU和NLP是什么，以及有关的词汇表(意图，实体，话语)和概念(意图解决方案，意图信心，用户示例)。

什么是嵌入？ (What are Embeddings ?)

Embeddings are a type of word or sentence representation that allows words or sentences with similar meaning to have a similar representation.

嵌入是一种单词或句子表示形式，它允许具有相似含义的单词或句子具有相似的表示形式。

While this sounds complex, the concept is easy to understand when looking on this scatter chart and an example:

虽然这听起来很复杂，但在查看此散点图和示例时，这个概念很容易理解：

each colored dot represents a word or a sentence
每个彩色圆点代表一个单词或一个句子
the lower the distance between two dots, the more similar the words or sentences are (in this case: from a semantical point of view)
两个点之间的距离越小，单词或句子越相似(在这种情况下：从语义的角度来看)
the higher the distance, the less similar they are
距离越大，它们越相似

As an example:

举个例子：

“I’d like to order a drink”
“我想点饮料”
“I want iced coffee”
“我要冰咖啡”
“not interested”
“没兴趣”

The first two sentences will be rather close in the Embeddings space, while the third one will appear distant to both of the first two.

前两个句子在Embeddings空间中的位置非常接近，而第三个句子与前两个句子的距离都较远。

Mathematically speaking, an embedding is a vector in an n-dimensional space — the higher n, the more complex concepts can be handled. It is not a trivial task to map natural language into an n-dimensional space while considering semantical similarity. Fortunately, there are ready-to-use models available for the most-spoken languages, for example the Universal Sentence Encoder developed by Google.

从数学上讲，嵌入是n维空间中的向量-n越高，可以处理的概念越复杂。在考虑语义相似性的同时将自然语言映射到n维空间并不是一件容易的事。幸运的是，有适用于最常用语言的现成模型，例如Google开发的Universal Sentence Encoder 。

An encoder is a neural network that takes the input, and outputs a feature map/vector/tensor — a point in n-dimensional space.

编码器是一个神经网络，它接受输入，并输出特征图/向量/张量-n维空间中的一个点。

Reducing this n-dimensional vector into a 2D representation to be visualized on a flat scatter chart is a matter of Principal Component Analysis (PCA).

将这个n维向量简化为2D表示以在平面散点图上可视化是主成分分析(PCA)的问题。

使用嵌入进行训练数据分析 (Using Embeddings for Training Data Analysis)

When training an NLU engine for chatbots, you typically have labeled training data available — a list of intents each with a couple of training phrases for each intent. Our tool of choice for showing a sample data analysis workflow is Botium Box.

在为聊天机器人训练NLU引擎时，通常会标记可用的训练数据-每个意图的列表，每个意图都带有几个训练短语。我们选择的用于显示样本数据分析工作流程的工具是Botium Box 。

Botium first generates semantic embeddings of the training phrases by using the Universal Sentence Encoder module and visualizes them in a 2D-map. Based on the similarity between the training phrases, the average similarity between the intents is computed (separation), as well as the average similarity of phrases within an intent (cohesion). This approach helps to identify training phrases that might confuse your chatbot — based on the similarity in the embedding space.

Botium首先通过使用通用句子编码器模块生成训练短语的语义嵌入，并将其可视化为2D映射。根据训练短语之间的相似度，可以计算(分离)意图之间的平均相似度，以及意图内短语的平均相似度(内聚度)。这种方法可根据嵌入空间的相似性来帮助识别可能使聊天机器人困惑的训练短语。

话语相似度 (Utterance Similarity)

Training phrases in different intents that have high similarity value can be confusing to the NLU engine, and could lead to directing the user input to the wrong intent.

具有高相似性值的不同意图中的训练短语可能会使NLU引擎感到困惑，并可能导致将用户输入定向到错误的意图。

意向分离(Intent Separation)

Given two intents, the average distance between each pair of training phrases in the two intents is shown.

给定两个意图，显示两个意图中每对训练短语之间的平均距离。

意向内聚(Intent Cohesion)

Cohesion is the average similarity value between each pair of training phrases in the same intent. That value is computed for each intent. The higher the intent cohesion value, the better the intent training phrases.

凝聚是具有相同意图的每对训练短语之间的平均相似度值。为每个意图计算该值。意向内聚值越高，意向训练短语越好。

改善聊天机器人培训短语(Improve Chatbot Training Phrases)

To improve the quality of the training phrases for your intents, consider the following approaches:

要提高针对您的意图的训练短语的质量，请考虑以下方法：

Find the phrases in different intents with high similarity in the Utterance Similarity table, and change or remove them
在“话语相似度”表中找到具有高度相似度的不同意图的短语，然后进行更改或删除
For intents with low cohesion, add more meaningful training phrases
对于低凝聚力的意图，请添加更多有意义的训练短语
For intent pairs with low separation, investigate training phrases
对于分离度较低的意图对，请研究训练短语

Give Botium Box a test drive today — start with the free Community Edition, we are happy to hear from you if you find it useful!

立即试用Botium Box ，从免费的Community Edition开始，如果您觉得它有用，我们很高兴收到您的来信！

寻找贡献者 (Looking for contributors)

Please take part in the Botium community to bring chatbots forward! By contributing you help in increasing the quality of chatbots worldwide, leading to increasing end-user acceptance, which again will bring your own chatbot forward! Start here

请加入Botium社区，推动聊天机器人前进！ 通过提供帮助，您可以提高全球聊天机器人的质量，从而提高最终用户的接受度，这又将使您自己的聊天机器人前进！ 从这里开始

翻译自: https://medium.com/analytics-vidhya/quality-metrics-for-nlu-chatbot-training-data-part-2-embeddings-57aa341d81fa

nlu训练

weixin_26729375

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
nlu训练_nlu chatbot训练数据第2部分嵌入的质量指标

nlu训练What are Embeddings? What is similarity, cohesion and separation?什么是嵌入？什么是相似性，内聚性和分离性？This article series provides an introduction to important quality metrics for your NLU engine and your chat...
复制链接

扫一扫