RASA_训练数据格式

最新推荐文章于 2024-02-22 01:16:26 发布

codelyq

最新推荐文章于 2024-02-22 01:16:26 发布

阅读量651

点赞数 25

分类专栏： 91_python 文章标签： python

本文链接：https://blog.csdn.net/weixin_41748874/article/details/136223244

版权

91_python 专栏收录该内容

17 篇文章 5 订阅

订阅专栏

文章目录

Training Data Format

Training Data Format

训练数据格式

This page describes the different types of training data that go into a Rasa assistant and how this training data is structured.

这个页面描述的是不同类型的训练数据。

这些训练数据是给Rasa助手使用的。

页面里头讲了训练数据的结构。

Overview

概述。

Rasa uses YAML as a unified and extendable way to manage all training data, including NLU data, stories and rules.

Rasa使用YAML来管理训练数据。

包括了NLU数据、stories和规则。

这说明了一个什么问题？

这说明了，将NLU数据、stories和规则看成是训练数据。

You can split the training data over any number of YAML files, and each file can contain any combination of NLU data, stories, and rules. The training data parser determines the training data type using top level keys.

您可以将训练数据分散在任意数量的YAML文件中，

每个文件可以包含任何组合的NLU数据、故事和规则。

训练数据解析器、使用顶层键来、确定训练数据类型。

这里提到了一个概念叫做顶层键，top level keys

The domain uses the same YAML format as the training data and can also be split across multiple files or combined in one file. The domain includes the definitions for responses and forms. See the documentation for the domain for information on how to format your domain file.

领域（domain）使用与训练数据相同的YAML格式，

也可以分散在多个文件中或合并到一个文件中。

领域包括响应和表单的定义。

有关如何格式化领域文件的详细信息，请参阅领域的文档。

这里就可以说明两个问题了。

训练数据使用YAML来管理：包括NLU数据、故事、规则。

领域数据使用YAML来管理：包括响应、表单。

LEGACY FORMATS

遗留格式

Looking for Markdown data format?

您正在寻找Markdown数据格式吗？

It’s removed in Rasa 3.0, but you can still find the documentation for markdown NLU data and markdown stories. If you still have your training data in Markdown format then the recommended approach is to use Rasa 2.x to convert your data from Markdown to YAML. The migration guide explains how to do this.

在Rasa 3.0中，该格式已被移除，但您仍然可以找到有关Markdown NLU数据和Markdown故事的文档。

如果您的训练数据仍采用Markdown格式，那么推荐的做法是使用Rasa 2.x将您的数据从Markdown转换为YAML。

迁移指南将解释如何执行此操作。

这说明了一个什么问题？，之前是可以使用markdown来组织训练数据的。也是可以使用markdown来描述故事的。

并且在Rasa 2.0当中是有工具的，是可以将markdown转为yaml的。

我的理解是这个意思。

High-Level Structure

Each file can contain one or more keys with corresponding training data.

每个文件可以包含一个或多个键以及相应的训练数据。

One file can contain multiple keys, but each key can only appear once in a single file.

一个文件可以包含多个键，但每个键在单个文件中只能出现一次。

The available keys are:

可用的键包括：

version

版本

nlu

NLU

stories

故事

rules

规则

You should specify the version key in all YAML training data files. If you don’t specify a version key in your training data file, Rasa will assume you are using the latest training data format specification supported by the version of Rasa you have installed. Training data files with a Rasa version greater than the version you have installed on your machine will be skipped. Currently, the latest training data format specification for Rasa 3.x is 3.1.

您应该在所有YAML训练数据文件中指定version键。

如果您在训练数据文件中未指定version键，Rasa将假定您正在使用您所安装的Rasa版本支持的最新训练数据格式规范。

训练数据文件所使用的Rasa版本高于您机器上所安装的版本时，该文件将被跳过。

目前，Rasa 3.x的最新训练数据格式规范为3.1。

如果你的训练数据文件的版本，高于你安装的rasa支持的版本，那么这个训练数据文件就会被跳过的。

Example

Here’s a short example which keeps all training data in a single file:

下面是一个将所有训练数据保存在单个文件中的简短示例：

version: "3.1"

nlu:
- intent: greet
  examples: |
    - Hey
    - Hi
    - hey there [Sara](name)

- intent: faq/language
  examples: |
    - What language do you speak?
    - Do you only handle english?

stories:
- story: greet and faq
  steps:
  - intent: greet
  - action: utter_greet
  - intent: faq
  - action: utter_faq

rules:
- rule: Greet user
  steps:
  - intent: greet
  - action: utter_greet

stories就是流程，rules就是规则。

To specify your test stories, you need to put them into a separate file:

要指定您的测试故事，您需要将其放入单独的文件中：

stories:
- story: greet and ask language
- steps:
  - user: |
      hey
    intent: greet
  - action: utter_greet
  - user: |
      what language do you speak
    intent: faq/language
  - action: utter_faq

在测试的故事中，就是user、intent、action三个东西循环了。

Test stories use the same format as the story training data and should be placed in a separate file with the prefix test_.

测试故事使用与故事训练数据相同的格式，并应放在单独的文件中，文件名的前缀为test_。

`THE | SYMBOL`

符号 |

As shown in the above examples, the user and examples keys are followed by | (pipe) symbol.

In YAML | identifies multi-line strings with preserved indentation.

This helps to keep special symbols like ", ' and others still available in the training examples.

如上例所示，用户（user）和示例（examples）键后面跟着的是 |（管道）符号。

在YAML中，| 用于标识保留缩进的多行字符串。

这有助于在训练示例中保留像 ", ' 等特殊符号。

NLU Training Data

NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user's message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly.

NLU训练数据由按意图分类的用户示例语句组成。

训练示例还可以包括实体。实体是可以从用户消息中提取的结构化信息片段。您还可以向训练数据中添加额外的信息，如正则表达式和查找表，以帮助模型正确识别意图和实体。

NLU training data is defined under the nlu key. Items that can be added under this key are:

NLU训练数据在nlu键下定义。可以在此键下添加的项目包括：

Training examples

训练示例（例句（entity标注、value标注）和意图）

Training examples grouped by user intent e.g. optionally with annotated entities

按用户意图分组的训练示例，例如，可选择性地对实体进行标注

nlu:
- intent: check_balance
  examples: |
    - What's my [credit](account) balance?
    - What's the balance on my [credit card account]{"entity":"account","value":"credit"}

Synonyms

同义词

nlu:
- synonym: credit
  examples: |
    - credit card account
    - credit account

Regular expressions

正则表达式

nlu:
- regex: account_number
  examples: |
    - \d{10,12}

Lookup tables - 枚举

nlu:
- lookup: banks
  examples: |
    - JPMC
    - Comerica
    - Bank of America

Training Examples

训练示例

Training examples are grouped by intent and listed under the examples key.

训练示例按 intent（意图）分组，并在 examples 键下列出。

Usually, you’ll list one example per line as follows:

通常，你会 每行列出一个示例，如下所示：

nlu:
- intent: greet
  examples: |
    - hey
    - hi
    - whats up

However, it’s also possible to use an extended format if you have a custom NLU component and need metadata for your examples:

然而，如果你有一个 自定义的 NLU 组件，并且需要为你的示例添加 元数据，那么你也可以使用 扩展格式：

nlu:
- intent: greet
  examples:
  - text: |
      hi
    metadata:
      sentiment: neutral
  - text: |
      hey there!

上面的sentiment是情感，neutral是中立的，读音是牛秋尔。

The metadata key can contain arbitrary key-value data that is tied to an example and accessible by the components in the NLU pipeline. In the example above, the sentiment metadata could be used by a custom component in the pipeline for sentiment analysis.

metadata 键可以包含与 一个示例 相关联的 任意键值数据，

这些数据可由 NLU 管道中的组件 访问。

在上面的示例中，情感元数据 可由管道中的 一个自定义组件 用于 情感分析。

You can also specify this metadata at the intent level:

nlu:
- intent: greet
  metadata:
    sentiment: neutral
  examples:
  - text: |
      hi
  - text: |
      hey there!

In this case, the content of the metadata key is passed to every intent example.

在这种情况下，元数据键的内容 会传递给 每个意图示例。

If you want to specify retrieval intents, then your NLU examples will look as follows:

如果你想指定 检索意图，那么 你的 NLU 示例 将如下所示：

这个意思是说，有intent也有retrieval intents吗？

检索意图的意思是，需要去查知识库的。

nlu:
- intent: chitchat/ask_name
  examples: |
    - What is your name?
    - May I know your name?
    - What do people call you?
    - Do you have a name for yourself?

- intent: chitchat/ask_weather
  examples: |
    - What's the weather like today?
    - Does it look sunny outside today?
    - Oh, do you mind checking the weather for me please?
    - I like sunny days in Berlin.

All retrieval intents have a suffix added to them which identifies a particular response key for your assistant. In the above example, ask_name and ask_weather are the suffixes. The suffix is separated from the retrieval intent name by a / delimiter.

所有 检索意图 都会添加一个 后缀，用于标识 你的助手 的 特定响应键。在上面的示例中，ask_name 和 ask_weather 是 后缀。后缀 与 检索意图名称 之间使用 一个/分隔符 进行分隔。

SPECIAL MEANING OF /

As shown in the above examples, the / symbol is reserved as a delimiter to separate retrieval intents from their associated response keys. Make sure not to use it in the name of your intents.

如上面的示例所示，/ 符号 被 保留 作为 分隔符，用于将 检索意图 与其 相关的响应键 分隔开。请确保不要在 意图的名称 中使用它。

上面的chitchat/ask_name，chitchat就是检索意图，ask_name是响应键。

Entities

实体

Entities are structured pieces of information that can be extracted from a user's message.

实体 是可以从 用户的消息 中 提取 出来的 结构化信息片段。

Entities are annotated in training examples with the entity's name.

在 训练示例 中，实体 会用 实体的名称 进行 标注。

In addition to the entity name, you can annotate an entity with synonyms, roles, or groups.

除了 实体 名称之外，你还可以用 同义词、角色 或 组 来 标注一个实体。

In training examples, entity annotation would look like this:

在 训练示例 中，实体标注看起来像这样：

nlu:
- intent: check_balance
  examples: |
    - how much do I have on my [savings](account) account
    - how much money is in my [checking]{"entity": "account"} account
    - What's the balance on my [credit card account]{"entity":"account","value":"credit"}

The full possible syntax for annotating an entity is:

[<entity-text>]{"entity": "<entity name>", "role": "<role name>", "group": "<group name>", "value": "<entity synonym>"}

The keywords role, group, and value are optional in this notation. The value field refers to synonyms. To understand what the labels role and group are for, see the section on entity roles and groups.

在这个表示法中，role、group 和 value 这三个关键词是可选的。value 字段指的是 同义词。要了解 role 和 group 这两个标签的用途，请参阅 实体角色和组 部分。

Synonyms

Synonyms normalize your training data by mapping an extracted entity to a value other than the literal text extracted. You can define synonyms using the format:

同义词 通过将 提取的实体 映射到 提取的字面文本 以外的 某个值，从而对你的训练数据进行 规范化。你可以使用以下格式来定义 同义词：

nlu:
- synonym: credit
  examples: |
    - credit card account
    - credit account

You can also define synonyms in-line in your training examples by specifying the value of the entity:

你也可以通过 指定 实体 的 值，在 训练示例 中 内联 定义 同义词：

nlu:
- intent: check_balance
  examples: |
    - how much do I have on my [credit card account]{"entity": "account", "value": "credit"}
    - how much do I owe on my [credit account]{"entity": "account", "value": "credit"}

Read more about synonyms on the NLU Training Data page.

在 NLU 训练数据页面 上了解更多关于 同义词 的信息。

Regular Expressions

You can use regular expressions to improve intent classification and entity extraction using the RegexFeaturizer and RegexEntityExtractor components.

你可以使用 RegexFeaturizer 和 RegexEntityExtractor 组件，通过 正则表达式 来提高 意图分类 和 实体提取 的准确性。

The format for defining a regular expression is as follows:

定义 正则表达式 的格式如下：

nlu:
- regex: account_number
  examples: |
    - \d{10,12}

Here account_number is the name of the regular expression. When used as features for the RegexFeaturizer the name of the regular expression does not matter. When using the RegexEntityExtractor, the name of the regular expression should match the name of the entity you want to extract.

在这里，account_number 是 正则表达式 的名称。当用作 RegexFeaturizer 的 特征 时，正则表达式 的 名称 并不重要。当使用 RegexEntityExtractor 时，正则表达式的名称 应该与你想要提取的 实体的名称 相匹配。

Read more about when and how to use regular expressions with each component on the NLU Training Data page.

在 NLU 训练数据页面 上，你可以了解更多关于何时以及如何与每个组件一起使用 正则表达式 的信息。

Lookup Tables

Lookup tables are lists of words used to generate case-insensitive regular expression patterns. The format is as follows:

nlu:
- lookup: banks
  examples: |
    - JPMC
    - Bank of America

When you supply a lookup table in your training data, the contents of that table are combined into one large regular expression. This regex is used to check each training example to see if it contains matches for entries in the lookup table.

当你在 训练数据 中 提供 一个 查找表 时，该表的内容 会被组合成一个 大型正则表达式。这个 正则表达式 用于 检查每个训练示例，以查看 它是否包含查找表中条目的匹配项。

Lookup table regexes are processed identically to the regular expressions directly specified in the training data and can be used either with the RegexFeaturizer or with the RegexEntityExtractor. The name of the lookup table is subject to the same constraints as the name of a regex feature.

查找表正则表达式 的处理与直接在训练数据中 指定的正则表达式 相同，可以与 RegexFeaturizer 或 RegexEntityExtractor 一起使用。查找表的名称 受到 与 正则表达式特征 名称相同的 约束。