nlu训练_nlu chatbot训练数据第2部分嵌入的质量指标

nlu训练

What are Embeddings? What is similarity, cohesion and separation?

什么是嵌入? 什么是相似性,内聚性和分离性?

This article series provides an introduction to important quality metrics for your NLU engine and your chatbot training data. We will focus on practical usage of the introduced metrics, not on the mathematical and statistical background — I will add links to other articles for this purpose.

本系列文章介绍了NLU引擎和聊天机器人培训数据的重要质量指标。 我们将重点放在引入指标的实际用法上,而不是数学和统计背景上—为此,我将添加指向其他文章的链接。

This is part 2 of the Quality Metrics for NLU/Chatbot Training Data series of articles.

这是NLU / Chatbot培训数据质量指标系列的第2部分。

For this article series, you should have an understanding what NLU and NLP is and about the involved vocabulary (intent, entity, utterance) and concepts (intent resolution, intent confidence, user examples).

对于本系列文章,您应该了解NLU和NLP是什么,以及有关的词汇表(意图,实体,话语)和概念(意图解决方案,意图信心,用户示例)。

什么是嵌入? (What are Embeddings ?)

Embeddings are a type of word or sentence representation that allows words or sentences with similar meaning to have a similar representation.

嵌入是一种单词或句子表示形式,它允许具有相似含义的单词或句子具有相似的表示形式。

While this sounds complex, the concept is easy to understand when looking on this scatter chart and an example:

虽然这听起来很复杂,但在查看此散点图和示例时,这个概念很容易理解:

  • each colored dot represents a word or a sentence

    每个彩色圆点代表一个单词或一个句子
  • the lower the distance between two dots, the more similar the words or sentences are (in this case: from a semantical point of view)

    两个点之间的距离越小,单词或句子越相似(在这种情况下:从语义的角度来看)
  • the higher the distance, the less similar they are

    距离越大,它们越相似

As an example:

举个例子:

  • “I’d like to order a drink”

    “我想点饮料”
  • “I want iced coffee”

    “我要冰咖啡”
  • “not interested”

    “没兴趣”

The first two sentences will be rather close in the Embeddings space, while the third one will appear distant to both of the first two.

前两个句子在Embeddings空间中的位置非常接近,而第三个句子与前两个句子的距离都较远。

Mathematically speaking, an embedding is a vector in an n-dimensional space — the higher n, the more complex concepts can be handled. It is not a trivial task to map natural language into an n-dimensional space while considering semantical similarity. Fortunately, there are ready-to-use models available for the most-spoken languages, for example the Universal Sentence Encoder developed by Google.

从数学上讲,嵌入是n维空间中向量-n越高,可以处理的概念越复杂。 在考虑语义相似性的同时将自然语言映射到n维空间并不是一件容易的事。 幸运的是,有适用于最常用语言的现成模型,例如Google开发的Universal Sentence Encoder

An encoder is a neural network that takes the input, and outputs a feature map/vector/tensor — a point in n-dimensional space.

编码器是一个神经网络,它接受输入,并输出特征图/向量/张量-n维空间中的一个点。

Reducing this n-dimensional vector into a 2D representation to be visualized on a flat scatter chart is a matter of Principal Component Analysis (PCA).

将这个n维向量简化为2D表示以在平面散点图上可视化是主成分分析(PCA)的问题。

使用嵌入进行训练数据分析 (Using Embeddings for Training Data Analysis)

When training an NLU engine for chatbots, you typically have labeled training data available — a list of intents each with a couple of training phrases for each intent. Our tool of choice for showing a sample data analysis workflow is Botium Box.

在为聊天机器人训练NLU引擎时,通常会标记可用的训练数据-每个意图的列表,每个意图都带有几个训练短语。 我们选择的用于显示样本数据分析工作流程的工具是Botium Box

Botium first generates semantic embeddings of the training phrases by using the Universal Sentence Encoder module and visualizes them in a 2D-map. Based on the similarity between the training phrases, the average similarity between the intents is computed (separation), as well as the average similarity of phrases within an intent (cohesion). This approach helps to identify training phrases that might confuse your chatbot — based on the similarity in the embedding space.

Botium首先通过使用通用句子编码器模块生成训练短语的语义嵌入,并将其可视化为2D映射。 根据训练短语之间的相似度,可以计算(分离)意图之间的平均相似度,以及意图内短语的平均相似度(内聚度)。 这种方法可根据嵌入空间的相似性来帮助识别可能使聊天机器人困惑的训练短语。

话语相似度 (Utterance Similarity)

Training phrases in different intents that have high similarity value can be confusing to the NLU engine, and could lead to directing the user input to the wrong intent.

具有高相似性值的不同意图中的训练短语可能会使NLU引擎感到困惑,并可能导致将用户输入定向到错误的意图。

Image for post
Utterance similarity
话语相似度

意向分离(Intent Separation)

Given two intents, the average distance between each pair of training phrases in the two intents is shown.

给定两个意图,显示两个意图中每对训练短语之间的平均距离。

Image for post
Intent separation
意向分离

意向内聚(Intent Cohesion)

Cohesion is the average similarity value between each pair of training phrases in the same intent. That value is computed for each intent. The higher the intent cohesion value, the better the intent training phrases.

凝聚 是具有相同意图的每对训练短语之间的平均相似度值。 为每个意图计算该值。 意向内聚值越高,意向训练短语越好。

Image for post
Intent cohesion
意向凝聚力

改善聊天机器人培训短语(Improve Chatbot Training Phrases)

To improve the quality of the training phrases for your intents, consider the following approaches:

要提高针对您的意图的训练短语的质量,请考虑以下方法:

  • Find the phrases in different intents with high similarity in the Utterance Similarity table, and change or remove them

    在“话语相似度”表中找到具有高度相似度的不同意图的短语,然后进行更改或删除

  • For intents with low cohesion, add more meaningful training phrases

    对于低凝聚力的意图,请添加更多有意义的训练短语
  • For intent pairs with low separation, investigate training phrases

    对于分离度较低的意图对,请研究训练短语

Give Botium Box a test drive today — start with the free Community Edition, we are happy to hear from you if you find it useful!

立即试用Botium Box ,从免费的Community Edition开始,如果您觉得它有用,我们很高兴收到您的来信

寻找贡献者 (Looking for contributors)

Please take part in the Botium community to bring chatbots forward! By contributing you help in increasing the quality of chatbots worldwide, leading to increasing end-user acceptance, which again will bring your own chatbot forward! Start here

请加入Botium社区,推动聊天机器人前进! 通过提供帮助,您可以提高全球聊天机器人的质量,从而提高最终用户的接受度,这又将使您自己的聊天机器人前进! 从这里开始

翻译自: https://medium.com/analytics-vidhya/quality-metrics-for-nlu-chatbot-training-data-part-2-embeddings-57aa341d81fa

nlu训练

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值