Rasa教程系列-NLU-4-组件

注意:
为了清晰起见,官方重新命名了预定义的管道,以反映它们做了什么,而不是在Rasa NLU 0.15中使用了哪些库。tensorflow_embed_embeddings管道现在称为supervised_embeddings, spacy_sklearn现在称为pretrained_embeddings_spacy。如果你正在使用这些,请更新代码吧。


本文是Rasa NLU中每个内置组件配置选项的参考。如果希望构建自定义组件,请查看自定义NLU组件

1. 词向量

1.1 MitieNLP

MitieNLP说明
Short:MITIE initializer,即MitieNLPMITIE initializer的简称。
Outputs:
Requires:
描述:初始化mitie结构。每个mitie组件都依赖于此,因此应该将其放在任何使用mitie组件的每个管道的开头。
Configuration:MITIE库需要一个语言模型文件,必须在配置中指定如下:
pipeline:
- name: "MitieNLP"
  # language model to load
  model: "data/total_word_feature_extractor.dat"

更多关于MITIE的可以进一步查阅链接标题

1.2 SpacyNLP

SpacyNLP说明
Short:spacy language initializer
Outputs:
Requires:
描述:初始化spacy的结构。每个spacy组件都依赖于此,因此应该将其放在使用spacy组件的每个管道的开头。
Configuration:语言模型,默认将使用配置的语言。如果要使用的spacy模型的名称不同于language标记(“en”、“de”等),则可以使用配置变量指定模型名称,将名称将传递给模型:space.load(name)
pipeline:
- name: "SpacyNLP"
  # language model to load
  model: "en_core_web_md"

  # when retrieving word vectors, this will decide if the casing
  # of the word is relevant. E.g. `hello` and `Hello` will
  # retrieve the same vector, if set to `false`. For some
  # applications and models it makes sense to differentiate
  # between these two words, therefore setting this to `true`.
  case_sensitive: false

2. 文本特征化

文本 featurizers 分为两类:稀疏 featurizers 和稠密 featurizers 。稀疏 featurizers 返回的特征向量中有很多缺失值,比如值为0。由于这些特征向量通常会占用大量内存,所以将它们存储为稀疏特征。稀疏特征只存储非零值及其在向量中的位置。因此,可以节省了大量内存,能够在更大的数据集上训练。

默认情况下所有featurizers将返回一个长矩阵(大小为1x特征维度)。所有featurizer(除了ConveRTFeaturizer)可以选择返回一个序列。如果标志“return_sequence”设置为True,featurizer返回大小为token-length x特征维度的矩阵。所以,返回矩阵中每一个token都有一个对应的entry。否则,矩阵对整个句子将只有一个entry。如果想使用自定义特性CRFEntityExtractor,应该设置“return_sequence”真。更多细节,看看传递自定义特性到CRFEntityExtractor

2.1 MitieFeaturizer

MitieFeaturizer说明
Short:MITIE intent featurizer
Outputs:无,作为意图分类器的输入(例如SklearnIntentClassifier)
Requires:MitieNLP
Type:稠密featurizer
描述:使用MITIE featurizer为意图分类创建特性。需要注意的是:MitieIntentClassifier组件中并没有使用。目前,只有SklearnIntentClassifier能够使用预先计算的特性。
Configuration:配置方法如下:
pipeline:
- name: "MitieFeaturizer"

2.2 SpacyFeaturizer

SpacyFeaturizer说明
Short:spacy intent featurizer
Outputs:无,作为意图分类器的输入(例如SklearnIntentClassifier)
Requires:SpacyNLP
Type:稠密featurizer
描述:使用spacy featurizer为意图分类创建特性
Configuration:配置方法如下:
pipeline:
- name: "SpacyFeaturizer"

2.3 ConveRTFeaturizer

ConveRTFeaturizer说明
Short:使用ConveRT模型创建用户消息和响应(如果指定的话)的向量表示
Outputs:无,作为意图分类器和response selectors的输入,分别对应意图特征和响应特征。比如EmbeddingIntentClassifierResponseSelector
Requires:
Type:稠密featurizer
描述:为意图分类和response selection创建特性,使用默认签名来计算输入文本的向量表示。需要注意:(1)由于ConveRT模型仅在英语语料上训练,因此只有当训练数据是英语语言时才能使用这个featurizer。 (2)使用之前需要安装tensorflow_texttensorflow_hub),可以通过pip install rasa[convert]来安装。(3)当把return_sequence设置为True,Rasa将抛出一个错误,表示该选项目前不受支持。不要将此featurizer与任何其他featurizer的选项“return_sequence”设置为true时进行联合使用,否则训练将失败。但是,可以将这个featurizer与其他任何featurizer一起使用,只要将“return_sequence”设置为False即可。
Configuration:配置方法如下:
pipeline:
- name: "ConveRTFeaturizer"

2.4 RegexFeaturizer

RegexFeaturizer说明
Short:创建正则特征以支持意图和实体分类
Outputs:text_features and tokens.pattern
Requires:
Type:稀疏 featurizer
描述:为实体提取和意图分类创建特性。在训练期间,regex intent featurizer 以训练数据的格式创建一系列正则表达式列表。对于每个正则,都将设置一个特征,标记是否在输入中找到该表达式,然后将其输入到intent classifier / entity extractor 中以简化分类(假设分类器在训练阶段已经学习了该特征集合,该特征集合表示一定的意图)。将Regex特征用于实体提取目前仅CRFEntityExtractor组件支持!

注意:
在 featurizer 之前 需要先进行 tokenizer !

2.5 CountVectorsFeaturizer

CountVectorsFeaturizer说明
Short:创建用户信息和标签(意图和响应)的词袋表征
Outputs:无,用作意图分类器的输入,输入的意图特性以词袋表征(如EmbeddingIntentClassifier)
Requires:
Type:稀疏 featurizer
描述:为意图分类和 response selection创建特征。使用sklearnCountVectorizer创建用户消息和标签特征的词袋表征。所有token仅由数字组成(如123和99,但不会存在a123d)将被分配到相同的功能。
Configuration:通过analyzer参数能将featurizer配置为 word 或 character n-grams。默认下,analyzer是设置为 word,所以 word token计数作为特征。如果想要设置为character n-grams可以将analyzer设置为charchar_wbchar_wb仅从单词边界内的文本创建character n-grams;单词边缘的n-gram用空格填充。此选项可用于创建Subword Semantic Hashing。对于character n-grams,不要忘记增加min_ngrammax_ngram参数。否则,词汇表将只包含单个字母。另外,在处理OOV上,由于训练是在有限的词汇数据上进行的,因此不能保证在预测过程中算法不会遇到未知的单词(在训练过程中没有看到的单词,即OOV)。为了教算法如何处理未知的单词,训练数据中的一些单词可以用通用单词OOV_token代替。在这种情况下,在预测期间,所有未知单词将被视为通用单词OOV_token

例如,可以在训练数据中创建单独的intent outofscope,其中包含不同数量的OOV_token消息,可能还包含一些附加的通用单词。然后,算法可能会将含有未知单词的消息的意图分类为outofscope

pipeline:
- name: "CountVectorsFeaturizer"
  # whether to use a shared vocab
  "use_shared_vocab": False,
  # whether to use word or character n-grams
  # 'char_wb' creates character n-grams only inside word boundaries
  # n-grams at the edges of words are padded with space.
  analyzer: 'word'  # use 'char' or 'char_wb' for character
  # the parameters are taken from
  # sklearn's CountVectorizer
  # regular expression for tokens
  token_pattern: r'(?u)\b\w\w+\b'
  # remove accents during the preprocessing step
  strip_accents: None  # {'ascii', 'unicode', None}
  # list of stop words
  stop_words: None  # string {'english'}, list, or None (default)
  # min document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  min_df: 1  # float in range [0.0, 1.0] or int
  # max document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  max_df: 1.0  # float in range [0.0, 1.0] or int
  # set ngram range
  min_ngram: 1  # int
  max_ngram: 1  # int
  # limit vocabulary size
  max_features: None  # int or None
  # if convert all characters to lowercase
  lowercase: true  # bool
  # handling Out-Of-Vacabulary (OOV) words
  # will be converted to lowercase if lowercase is true
  OOV_token: None  # string or None
  OOV_words: []  # list of strings

注意:
如果模型语言中的单词不能用空格分隔,则在此组件之前的管道中需要一个特定语言的tokenizer (例如,对于中文使用JiebaTokenizer)。

3. 意图分类器

3.1 MitieIntentClassifier

MitieIntentClassifier说明
Short:MITIE intent classifier (使用text categorizer)
Outputs:意图
Requires:tokenizer 和 featurizer
Output-Example:{"intent": {"name": "greet", "confidence": 0.98343}}
描述:该分类器使用MITIE进行意图分类。底层分类器使用的是具有稀疏线性核的多类线性支持向量机(可以查看MITIE trainer code)
Configuration:具体配置如下:
pipeline:
- name: "MitieIntentClassifier"

3.2 SklearnIntentClassifier

SklearnIntentClassifier说明
Short:sklearn intent classifier
Outputs:意图 和 意图排名
Requires:一个featurizer
Output-Example:{"intent": {"name": "greet", "confidence": 0.78343},"intent_ranking": [{"confidence": 0.1485910906220309,"name": "goodbye"},{"confidence": 0.08161531595656784,"name":"restaurant_search"}]}
描述:该sklearn意图分类器训练一个线性支持向量机,该支持向量机通过网格搜索得到优化。除了其他分类器,它还提供没有“获胜”的标签的排名。spacy意图分类器需要在管道中的先加入一个featurizer。该featurizer创建用于分类的特征。
Configuration:在SVM的训练过程中,会运行超参数搜索,以找到最佳的参数集。在配置中,可以指定将要尝试的参数,具体配置如下:
pipeline:
- name: "SklearnIntentClassifier"
  # Specifies the list of regularization values to
  # cross-validate over for C-SVM.
  # This is used with the ``kernel`` hyperparameter in GridSearchCV.
  C: [1, 2, 5, 10, 20, 100]
  # Specifies the kernel to use with C-SVM.
  # This is used with the ``C`` hyperparameter in GridSearchCV.
  kernels: ["linear"]

3.3 EmbeddingIntentClassifier

EmbeddingIntentClassifier说明
Short:Embedding intent classifier
Outputs:意图 和 意图排名
Requires:一个featurizer
描述:嵌入式意图分类器将用户输入和意图标签嵌入到同一空间中。Supervised embeddings通过最大化它们之间的相似性来训练。该算法基于StarSpace的。但是,在这个实现中,损失函数略有不同,添加了额外的隐藏层和dropout。该算法还提供了未“获胜”标签的相似度排序。在embedding intent classifier之前,需要在管道中加入一个featurizer。该featurizer创建用以embeddings的特征。建议使用CountVectorsFeaturizer,它可选的预处理有SpacyNLPSpacyTokenizer
Configuration:算法涉及大超参数,较多这里就不一一列出。

在配置中,可以指定这些参数。在embeddingintentclassifier.default中定义了默认值:

defaults = {
    # nn architecture
    # sizes of hidden layers before the embedding layer for input words
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_a": [256, 128],
    # sizes of hidden layers before the embedding layer for intent labels
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_b": [],
    # Whether to share the hidden layer weights between input words and labels
    "share_hidden_layers": False,
    # training parameters
    # initial and final batch sizes - batch size will be
    # linearly increased for each epoch
    "batch_size": [64, 256],
    # how to create batches
    "batch_strategy": "balanced",  # string 'sequence' or 'balanced'
    # number of epochs
    "epochs": 300,
    # set random seed to any int to get reproducible results
    "random_seed": None,
    # embedding parameters
    # default dense dimension used if no dense features are present
    "dense_dim": {"text": 512, "label": 20},
    # dimension size of embedding vectors
    "embed_dim": 20,
    # the type of the similarity
    "num_neg": 20,
    # flag if minimize only maximum similarity over incorrect actions
    "similarity_type": "auto",  # string 'auto' or 'cosine' or 'inner'
    # the type of the loss function
    "loss_type": "softmax",  # string 'softmax' or 'margin'
    # how similar the algorithm should try
    # to make embedding vectors for correct labels
    "mu_pos": 0.8,  # should be 0.0 < ... < 1.0 for 'cosine'
    # maximum negative similarity for incorrect labels
    "mu_neg": -0.4,  # should be -1.0 < ... < 1.0 for 'cosine'
    # flag: if true, only minimize the maximum similarity for incorrect labels
    "use_max_sim_neg": True,
    # scale loss inverse proportionally to confidence of correct prediction
    "scale_loss": True,
    # regularization parameters
    # the scale of L2 regularization
    "C2": 0.002,
    # the scale of how critical the algorithm should be of minimizing the
    # maximum similarity between embeddings of different labels
    "C_emb": 0.8,
    # dropout rate for rnn
    "droprate": 0.2,
    # visualization of accuracy
    # how often to calculate training accuracy
    "evaluate_every_num_epochs": 20,  # small values may hurt performance
    # how many examples to use for calculation of training accuracy
    "evaluate_on_num_examples": 0,  # large values may hurt performance
}

Output-Example如下:

{
    "intent": {"name": "greet", "confidence": 0.8343},
    "intent_ranking": [
        {
            "confidence": 0.385910906220309,
            "name": "goodbye"
        },
        {
            "confidence": 0.28161531595656784,
            "name": "restaurant_search"
        }
    ]
}

注意:
如果在预测期间,一条消息只包含在训练期间没有看到的单词,并且没有使用out - of -vacary预处理器,则将以置信度0.0预测为空意图None

3.4 KeywordIntentClassifier

KeywordIntentClassifier说明
Short:简单的关键字匹配意图分类器,适于小型、短期的项目
Outputs:意图
Requires:
Output-Example:{"intent": {"name": "greet", "confidence": 1.0}}
描述:该分类器通过搜索关键字的消息来工作。默认情况下,匹配是大小写敏感的,只精确匹配地搜索用户消息中关键字。意图的关键字是NLU训练数据中意图的例子。这意味着整个示例是关键字,而不是示例中的单个单词。注意:此分类器仅用于小型项目或入门级项目。如果你有很少的NLU训练数据,则可以试试管道选择中一个管道。
Configuration:配置如下:
pipeline:
- name: "KeywordIntentClassifier"
  case_sensitive: True

4. 选择器Selectors

Response Selector说明
Short:Response Selector
Outputs:一个字典,关键字direct_response_intentvalue属性包含responseranking
Requires:A featurizer
描述:Response Selector组件可用以创建回复的召回模型,从而直接得到机器人的候选回复。模型的预测通过Retrieval Actions实现,将用户输入和回复标签嵌入到同一空间,所使用的神经网络架构和优化方法与EmbeddingIntentClassifier一样。在管道中的响应选择器 response selector 之前需要有一个featurizer。该featurizer创建用于embeddings的特征。建议使用CountVectorsFeaturizer,它可以选择由SpacyNLP先处理。
Configuration:包含了EmbeddingIntentClassifier使用的所有超参数。此外,还可以将组件配置为针对特定检索意图训练一个响应选择器。ResponseSelector.defaults中可以查看默认值:
defaults = {
    # nn architecture
    # sizes of hidden layers before the embedding layer for input words
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_a": [256, 128],
    # sizes of hidden layers before the embedding layer for intent labels
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_b": [256, 128],
    # Whether to share the hidden layer weights between input words and intent labels
    "share_hidden_layers": False,
    # training parameters
    # initial and final batch sizes - batch size will be
    # linearly increased for each epoch
    "batch_size": [64, 256],
    # how to create batches
    "batch_strategy": "balanced",  # string 'sequence' or 'balanced'
    # number of epochs
    "epochs": 300,
    # set random seed to any int to get reproducible results
    "random_seed": None,
    # embedding parameters
    # default dense dimension used if no dense features are present
    "dense_dim": {"text": 512, "label": 20},
    # dimension size of embedding vectors
    "embed_dim": 20,
    # the type of the similarity
    "num_neg": 20,
    # flag if minimize only maximum similarity over incorrect actions
    "similarity_type": "auto",  # string 'auto' or 'cosine' or 'inner'
    # the type of the loss function
    "loss_type": "softmax",  # string 'softmax' or 'margin'
    # how similar the algorithm should try
    # to make embedding vectors for correct intent labels
    "mu_pos": 0.8,  # should be 0.0 < ... < 1.0 for 'cosine'
    # maximum negative similarity for incorrect intent labels
    "mu_neg": -0.4,  # should be -1.0 < ... < 1.0 for 'cosine'
    # flag: if true, only minimize the maximum similarity for
    # incorrect intent labels
    "use_max_sim_neg": True,
    # scale loss inverse proportionally to confidence of correct prediction
    "scale_loss": True,
    # regularization parameters
    # the scale of L2 regularization
    "C2": 0.002,
    # the scale of how critical the algorithm should be of minimizing the
    # maximum similarity between embeddings of different intent labels
    "C_emb": 0.8,
    # dropout rate for rnn
    "droprate": 0.2,
    # visualization of accuracy
    # how often to calculate training accuracy
    "evaluate_every_num_epochs": 20,  # small values may hurt performance
    # how many examples to use for calculation of training accuracy
    "evaluate_on_num_examples": 0,  # large values may hurt performance,
    # selector config
    # name of the intent for which this response selector is to be trained
    "retrieval_intent": None,
}

其中retrieval_intent:设置训练此响应选择器模型的意图的名称。默认是None

Output-Example:

{
    "text": "What is the recommend python version to install?",
    "entities": [],
    "intent": {"confidence": 0.6485910906220309, "name": "faq"},
    "intent_ranking": [
        {"confidence": 0.6485910906220309, "name": "faq"},
        {"confidence": 0.1416153159565678, "name": "greet"}
    ],
    "response_selector": {
      "faq": {
        "response": {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
        "ranking": [
            {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
            {"confidence": 0.2134543431, "name": "You can ask me about how to get started"}
        ]
      }
    }
}

5. 分词器Tokenizers

5.1 WhitespaceTokenizer

WhitespaceTokenizer说明
Short:Tokenizer using whitespaces as a separator
Outputs:
Requires:
描述:为每个以空格分隔的字符序列创建token。定义的token可用于MITIE实体提取器。
Configuration:如果想把意图分成多个标签,例如,为了预测多个意图或为分层的意图结构建模,使用intent_split_symbol标志。可以通过case_sensitive设置是否大小写敏感。

5.2 JiebaTokenizer

JiebaTokenizer说明
Short:使用Jieba作为 Tokenizer
Outputs:
Requires:
描述:用于中文的Tokenizer,对于其他语种Jieba会如WhitespaceTokenizer般工作。JiebaTokenizer可为MITIE实体抽取器定义token。
Configuration:用户的自定义字典文件可以通过特定的文件目录路径dictionary_path自动加载。具体示例:
pipeline:
- name: "JiebaTokenizer"
  dictionary_path: "path/to/custom/dictionary/dir"

5.3 MitieTokenizer

MitieTokenizer说明
Short:Tokenizer using MITIE
Outputs:
Requires:MitieNLP
描述:用MITIE tokenizer创建tokens,从而服务于 MITIE 实体抽取
Configuration:示例如下:
pipeline:
- name: "MitieTokenizer"

5.4 SpacyTokenizer

SpacyTokenizer说明
Short:Tokenizer using spacy
Outputs:
Requires:SpacyNLP
描述:用spacy tokenizer创建tokens,从而服务于 MITIE 实体抽取

6. 实体抽取器Entity Extractors

6.1 MitieEntityExtractor

MitieEntityExtractor说明
Short:MITIE entity extraction (使用MITIE NER trainer)
Outputs:entities
Requires:MitieNLP
描述:用 MITIE entity extraction抽取语句中的实体。底层分类器使用具有稀疏线性核自定义特征的多类线性支持向量机。该MITIE组件不提供实体置信值。
Configuration:配置示例如下:
pipeline:
- name: "MitieEntityExtractor"

Output-Example:

{
    "entities": [{"value": "New York City",
                  "start": 20,
                  "end": 33,
                  "confidence": null,
                  "entity": "city",
                  "extractor": "MitieEntityExtractor"}]
}

6.2 SpacyEntityExtractor

SpacyEntityExtractor说明
Short:spaCy entity extraction
Outputs:entities
Requires:SpacyNLP
描述:该组件使用spaCy来预测消息的实体。spacy使用统计BILOU转移模型。到目前为止,该组件只能使用spacy内置的实体提取模型,不能进行再训练。此提取器不提供任何置信评分。
Configuration:配置spacy组件应该提取哪些维度,比如实体类型。可用维度的完整列表可以在spaCy文档中找到。不指定维度选项将提取所有可用维度。具体示例如下:
pipeline:
- name: "SpacyEntityExtractor"
  # dimensions to extract
  dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]

Output-Example:

{
    "entities": [{"value": "New York City",
                  "start": 20,
                  "end": 33,
                  "entity": "city",
                  "confidence": null,
                  "extractor": "SpacyEntityExtractor"}]
}

6.3 EntitySynonymMapper

EntitySynonymMapper说明
Short:将同义词映射到同一个值
Outputs:修改以前的实体提取组件找到的现有实体
Requires:
描述:如果训练数据包含已定义的同义词(通过对实体示例使用value属性)。此组件将确保检测到的实体值映射到相同的值。例如,如果训练数据包含以下例子:
[{
  "text": "I moved to New York City",
  "intent": "inform_relocation",
  "entities": [{"value": "nyc",
                "start": 11,
                "end": 24,
                "entity": "city",
               }]
},
{
  "text": "I got a new flat in NYC.",
  "intent": "inform_relocation",
  "entities": [{"value": "nyc",
                "start": 20,
                "end": 23,
                "entity": "city",
               }]
}]

该组件将实体New York CityNYC映射到nyc。即使消息包含NYC,实体提取将返回nyc。当该组件更改现有实体时,它将自己附加到该实体的处理器列表中。

6.4 CRFEntityExtractor

CRFEntityExtractor说明
Short:条件随机场实体抽取器
Outputs:entities
Requires:一个tokenizer
描述:此组件使用条件随机场来进行命名实体识别。CRFs可以被认为是一个无向的马尔可夫链,其中时间步长是单词,状态是实体类别。单词的特征(大写,词性标注POS,等等)给出了特定实体类别的概率,就像相邻实体标记之间的转换一样:然后计算并返回最可能的标记结果。如果使用POS功能(pos或pos2),则必须安装spaCy。如果想使用额外的功能,如预训练的词嵌入,稠密的featurizer,则可以使用“text_dense_features”。确保在相应的featurizer中将“return_sequence”设置为True。
Configuration:配置示例如下:
pipeline:
- name: "CRFEntityExtractor"
  # The features are a ``[before, word, after]`` array with
  # before, word, after holding keys about which
  # features to use for each word, for example, ``"title"``
  # in array before will have the feature
  # "is the preceding word in title case?".
  # Available features are:
  # ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,
  # ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,
  # ``bias``, ``upper``, ``digit``, ``pattern``, and ``text_dense_features``
  features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]

  # The flag determines whether to use BILOU tagging or not. BILOU
  # tagging is more rigorous however
  # requires more examples per entity. Rule of thumb: use only
  # if more than 100 examples per entity.
  BILOU_flag: true

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  max_iterations: 50

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  # Specifies the L1 regularization coefficient.
  L1_c: 0.1

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  # Specifies the L2 regularization coefficient.
  L2_c: 0.1

Output-Example:

{
    "entities": [{"value":"New York City",
                  "start": 20,
                  "end": 33,
                  "entity": "city",
                  "confidence": 0.874,
                  "extractor": "CRFEntityExtractor"}]
}

6.5 DucklingHTTPExtractor

DucklingHTTPExtractor说明
Short:借助Duckling可以提取诸如日期、金额、距离等常见实体,且适用于多种语言。
Outputs:entities
Requires:
描述:为了使用该组件需要启动一个duckling server。最简单的选择是使用docker container:docker run -p 8000:8000 rasa/duckling。另外,也可以直接在机器上安装Duckling再启动服务。Duckling可以识别日期、数字、距离和其他结构化实体和规范。请注意,duckling 试图提取尽可能多的实体类型,但没有提供排名。例如,对于文本I will be there in 10 minutes。如果在duckling组件内同时指定numbertime维度,则该组件将提取两个实体:10作为数字和10 minutes作为时间。在这种情况下,应用程序必须决定哪些实体类型是正确的。抽取器将始终返回1.0的置信度,因为这是一个基于规则的系统。
Configuration:配置duckling组件应该提取哪些维度,即实体类型。在duckling文档中可以找到可用维度的完整列表。不指定维度选项将提取所有可用维度。具体的配置示例如下:
pipeline:
- name: "DucklingHTTPExtractor"
  # url of the running duckling server
  url: "http://localhost:8000"
  # dimensions to extract
  dimensions: ["time", "number", "amount-of-money", "distance"]
  # allows you to configure the locale, by default the language is
  # used
  locale: "de_DE"
  # if not set the default timezone of Duckling is going to be used
  # needed to calculate dates from relative expressions like "tomorrow"
  timezone: "Europe/Berlin"
  # Timeout for receiving response from http url of the running duckling server
  # if not set the default timeout of duckling http url is set to 3 seconds.
  timeout : 3

Output-Example:

{
    "entities": [{"end": 53,
                  "entity": "time",
                  "start": 48,
                  "value": "2017-04-10T00:00:00.000+02:00",
                  "confidence": 1.0,
                  "extractor": "DucklingHTTPExtractor"}]
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值