对话系统Deep Learning lab-2 配置文件Configuration file

最新推荐文章于 2023-11-29 21:57:18 发布

huangq_qiao

最新推荐文章于 2023-11-29 21:57:18 发布

阅读量410

点赞数

分类专栏： deeppavlov框架学习

原文链接：http://docs.deeppavlov.ai/en/master/intro/configuration.html

版权

deeppavlov框架学习专栏收录该内容

8 篇文章 1 订阅

订阅专栏

Configuration file 配置文件
Docs » Configuration file

翻译自对话系统Deep Learning lab-2 配置文件

An NLP pipeline config is a JSON file that contains one required element chainer:

NLP 管道配置是一个 JSON 文件，它包含一个必需的元素 chainer:

{
  "chainer": {
    "in": ["x"],
    "in_y": ["y"],
    "pipe": [
      ...
    ],
    "out": ["y_predicted"]
  }
}

Chainer is a core concept of DeepPavlov library: chainer builds a pipeline from heterogeneous components (Rule-Based/ML/DL) and allows to train or infer from pipeline as a whole. Each component in the pipeline specifies its inputs and outputs as arrays of names, for example: “in”: [“tokens”, “features”] and “out”: [“token_embeddings”, “features_embeddings”] and you can chain outputs of one components with inputs of other components:

Chainer 是 DeepPavlov 库的核心概念: Chainer 从异构组件(rule-based / ml / dl)构建pipe，并允许从整个pipe进行训练或推断。 pipe中的每个组件都将其输入和输出指定为名称数组，例如: “ in” : [“ token” ，“ features”]和“ out” : [“ token embeddings” ，“ features embeddings”] ，您可以将一个组件的输出与其他组件的输入链接起来:

{
  "class_name": "deeppavlov.models.preprocessors.str_lower:StrLower",
  "in": ["x"],
  "out": ["x_lower"]
},
{
  "class_name": "nltk_tokenizer",
  "in": ["x_lower"],
  "out": ["x_tokens"]
},

Each Component in the pipeline must implement method call() and has class_name parameter, which is its registered codename, or full name of any python class in the form of “module_name:ClassName”. It can also have any other parameters which repeat its init() method arguments. Default values of init() arguments will be overridden with the config values during the initialization of a class instance.

管道中的每个组件都必须实现方法__call__() ，并具有类名参数class_name，该参数是它的注册代码，或任何 python 类的全名，形式为“module_name: ClassName”。它还可以有任何其他重复其__init__()初始化方法参数的参数。在类实例的初始化期间__init__() 参数的默认值将被 config 值覆盖,即可以用config配置的值改写__init__()中参数的默认值。

You can reuse components in the pipeline to process different parts of data with the help of id and ref parameters:

可以通过 id 和 ref 参数重用管道中的组件来处理数据的不同部分:

{
  "class_name": "nltk_tokenizer",
  "id": "tokenizer",
  "in": ["x_lower"],
  "out": ["x_tokens"]
},
{
  "ref": "tokenizer",
  "in": ["y"],
  "out": ["y_tokens"]
},

Variables 变量

As of version 0.1.0 every string value in a configuration file is interpreted as a format string where fields are evaluated from metadata.variables element:

在0.1.0版本中，配置文件中的每个字符串值都被解释为一个格式字符串，其中字段由 metadata.variables 元素求值:

{
  "chainer": {
    "in": ["x"],
    "pipe": [
      {
        "class_name": "my_component",
        "in": ["x"],
        "out": ["x"],
        "load_path": "{MY_PATH}/file.obj"
      },
      {
        "in": ["x"],
        "out": ["y_predicted"],
        "config_path": "{CONFIGS_PATH}/classifiers/intents_snips.json"
      }
    ],
    "out": ["y_predicted"]
  },
  "metadata": {
    "variables": {
      "MY_PATH": "/some/path",
      "CONFIGS_PATH": "{DEEPPAVLOV_PATH}/configs"
    }
  }
}

Variable DEEPPAVLOV_PATH is always preset to be a path to the deeppavlov python module.

变量 DEEPPAVLOV_PATH 始终预设为到 DEEPPAVLOV python 模块的路径。
在这里插入图片描述
因为我是用Anaconda安装的tensorflow+python，所以DEEPPAVLOV python 模块路径在D:\Anaconda3\Lib\site-packages\deeppavlov，python3.6.2,tensorflow1.14.0

One can override configuration variables using environment variables with prefix DP_. So environment variable DP_VARIABLE_NAME will override VARIABLE_NAME inside a configuration file.

可以使用前缀 DP_ 的环境变量覆盖配置变量。因此，DP_VARIABLE_NAME 环境变量将覆盖配置文件中的 VARIABLE_NAME。

For example, adding DP_ROOT_PATH=/my_path/to/large_hard_drive will make most configs use this path for downloading and reading embeddings/models/datasets.

例如，增加 DP_ROOT_PATH=/my_path/to/large_hard_drive将使大多数配置使用此路径来下载和读取 embeddings / models / datasets数据集。

Training 训练

There are two abstract classes for trainable components: Estimator and NNModel.

可训练组件有两个抽象类: Estimator 和 NNModel。

Estimator are fit once on any data with no batching or early stopping, so it can be safely done at the time of pipeline initialization. fit() method has to be implemented for each Estimator. One example is Vocab.

Estimator适用于任何不需要批处理或提前停止的数据，可以在管道初始化时完成。必须为每个Estimator实现fit()方法。其中一个例子就是 Vocab。

NNModel requires more complex training. It can only be trained in a supervised mode (as opposed to Estimator which can be trained in both supervised and unsupervised settings). This process takes multiple epochs with periodic validation and logging. train_on_batch() method has to be implemented for each NNModel.

NNModel 需要更复杂的训练。它只能在监督模式下训练(相对于Estimator，Estimator可以在监督和非监督模式下训练)。这个过程需要多轮的定期验证和日志记录。必须为每个 NNModel 实现train_on_batch()方法。

Training is triggered by train_model() function.

训练由 train_model() 函数触发。

Train config 训练配置

Estimator s that are trained should also have fit_on parameter which contains a list of input parameter names. An NNModel should have the in_y parameter which contains a list of ground truth answer names. For example:

经过训练的Estimator抽象类需要有fit_on参数，包含输入参数名称的列表。
NNModel 抽象类应该有 in_y 参数，其中包含ground truth回答名字的列表。
（ ground truth 在文献中翻译的意思是地面实况，放到机器学习里面，再抽象点可以把它理解为真值、真实的有效值或者是标准的答案）
例如:

[
  {
    "id": "classes_vocab",
    "class_name": "default_vocab",
    "fit_on": ["y"],
    "level": "token",
    "save_path": "vocabs/classes.dict",
    "load_path": "vocabs/classes.dict"
  },
  {
    "in": ["x"],
    "in_y": ["y"],
    "out": ["y_predicted"],
    "class_name": "intent_model",
    "save_path": "classifiers/intent_cnn",
    "load_path": "classifiers/intent_cnn",
    "classes_vocab": {
      "ref": "classes_vocab"
    }
  }
]

The config for training the pipeline should have three additional elements: dataset_reader, dataset_iterator and train:

除了chainer，用于训练管道的配置应该有三个额外的元素: dataset_reader数据集读取器、dataset_iterator 数据集迭代器和train训练器:

{
  "dataset_reader": {
    "class_name": ...,
    ...
  },
  "dataset_iterator": {
    "class_name": ...,
    ...
  },
  "chainer": {
    ...
  },
  "train": {
    ...
  }
}

Simplified version of training pipeline contains two elements: dataset and train. The dataset element currently can be used for train from classification data in csv and json formats. You can find complete examples of how to use simplified training pipeline in intents_sample_csv.json and intents_sample_json.json config files.

训练管道的简化版本包含两个元素: 数据集 dataset 和训练train。
dataset 元素目前可以用于训练csv 和 json 格式的分类数据。
可以在 intents_sample_csv.json和intents_sample_json.json 配置文件中中找到如何使用简化训练管道的完整示例。

下列代码即为intents_sample_json.json的代码文件，在上述链接打开的github中
包括dataset、chainer、train、metadata.variables四大块

{
  "dataset": {
    "type": "classification",
    "format": "json",
    "orient": "records",
    "lines": true,
    "data_path": "{DOWNLOADS_PATH}/sample",
    "train": "sample.json",
    "x": "text",
    "y": "intents",
    "url": "http://files.deeppavlov.ai/datasets/snips_intents/train.json",
    "seed": 42,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "{MODEL_PATH}/classes.dict",
        "load_path": "{MODEL_PATH}/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "fasttext",
        "load_path": "{DOWNLOADS_PATH}/embeddings/dstc2_fastText_model.bin",
        "pad_zero": true
      },
      {
        "in": "y_ids",
        "out": "y_onehot",
        "class_name": "one_hotter",
        "depth": "#classes_vocab.len",
        "single_vector": true
      },
      {
        "in": [
          "x_emb"
        ],
        "in_y": [
          "y_onehot"
        ],
        "out": [
          "y_pred_probas"
        ],
        "main": true,
        "class_name": "keras_classification_model",
        "save_path": "{MODEL_PATH}/model",
        "load_path": "{MODEL_PATH}/model",
        "embedding_size": "#my_embedder.dim",
        "n_classes": "#classes_vocab.len",
        "kernel_sizes_cnn": [
          1,
          2,
          3
        ],
        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "binary_crossentropy",
        "coef_reg_cnn": 1e-4,
        "coef_reg_den": 1e-4,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "class_name": "proba2labels",
        "max_proba": true
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      }
    ],
    "out": [
      "y_pred_labels"
    ]
  },
  "train": {
    "epochs": 100,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy",
      "f1_macro",
      {
        "name": "roc_auc",
        "inputs": ["y_onehot", "y_pred_probas"]
      }
    ],
    "validation_patience": 5,
    "val_every_n_epochs": 1,
    "log_every_n_epochs": 1,
    "show_examples": false,
    "evaluation_targets": [
      "train",
      "valid"
    ],
    "class_name": "nn_trainer"
  },
  "metadata": {
    "variables": {
      "ROOT_PATH": "~/.deeppavlov",
      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
      "MODELS_PATH": "{ROOT_PATH}/models",
      "MODEL_PATH": "{MODELS_PATH}/classifiers/intents_snips_v9"
    },
    "requirements": [
      "{DEEPPAVLOV_PATH}/requirements/tf.txt",
      "{DEEPPAVLOV_PATH}/requirements/fasttext.txt"
    ],
    "labels": {
      "telegram_utils": "IntentModel",
      "server_utils": "KerasIntentModel"
    },
    "download": [
      {
        "url": "http://files.deeppavlov.ai/datasets/snips_intents/train.json",
        "subdir": "{DOWNLOADS_PATH}/sample"
      },
      {
        "url": "http://files.deeppavlov.ai/deeppavlov_data/embeddings/dstc2_fastText_model.bin",
        "subdir": "{DOWNLOADS_PATH}/embeddings"
      },
      {
        "url": "http://files.deeppavlov.ai/deeppavlov_data/classifiers/intents_snips_v9.tar.gz",
        "subdir": "{MODELS_PATH}/classifiers"
      }
    ]
  }
}

Train Parameters 训练参数——class_name+metrics

epochs —— 训练NNModel的最大轮数，默认是-1（代表无穷）
batch_size —— 批量大小
metrics —— 用于评估模型的注册的指标列表，列表中的第一个指标用于早期停止
注：早期停止（Early Stop）神经网络中具体的做法如下：

首先将训练数据划分为训练集和验证集（划分比例为2:1）；
在训练集上进行训练，并且在验证集上获取测试结果（比如每隔５个epoch测试一下），随着epoch的增加，如果在验证集上发现测试误差上升，则停止训练；
将停止之后的权重作为网络的最终参数。
Early Stop能够防止过拟合。

metric_optimization —— 最优化度量，即最大化（maximize ）或最小化（minimize ）度量，默认为最大化（maximize ）
validation_patience —— 早期停止前，在一行中有多少次验证指标没有改进，默认值为5
val_every_n_epochs —— 验证管道的频率，默认为-1(从不)
log_every_n_batches，log_every_n_epochs —— 计算训练数据指标的频率，默认为-1(从不)
validate_best，test_best —— 在有效和测试数据上用于推断保存的最佳模型的标志，默认为true
tensorboard_log_dir —— 在训练期间编写日志度量的路径。使用tensorboard可以实现可视化度量图。

train element can contain a class_name parameter that references a trainer class (default value is nn_trainer). All other parameters will be passed as keyword arguments to the trainer class’s constructor.

训练元素可以包含一个引用 trainer 类的 class name 参数(默认值是 nn_trainer)。所有其他参数将作为关键字参数传递给 trainer 类的构造函数。

Metrics 衡量标准

"train": {
  "class_name": "nn_trainer",
  "metrics": [
    "f1",
    {
      "name": "accuracy",
      "inputs": ["y", "y_labels"]
    },
    {
      "name": "roc_auc",
      "inputs": ["y", "y_probabilities"]
    }
  ],
  ...
}

The first metric in the list is used for early stopping.

列表中的第一个metrics度量用于提前停止

Each metric can be described as a JSON object with name and inputs properties, where name is a registered name of a metric function and inputs is a list of parameter names from chainer’s inner memory that will be passed to the metric function.

每个度量都可以被描述为带有名称name和输入属性input的JSON对象，其中name是一个度量函数的注册名称，input是一个参数名称列表，这些参数名称来自chainer的内部内存，将传递给度量函数。

If a metric is described as a single string, this string is interpreted as a registered name.

如果一个度量被描述为单个字符串，则该字符串被解释为一个注册名称。

Default value for inputs parameter is a concatenation of chainer’s in_y and out parameters.

input参数的默认值是chainer的in_y和out参数的串联。

DatasetReader

DatasetReader class reads data and returns it in a specified format. A concrete DatasetReader class should be inherited from this base class and registered with a codename:

DatasetReader 类读取数据并以指定的格式返回。一个具体的 DatasetReader 类应该从这个基类继承，并使用codename代码名注册:

from deeppavlov.core.common.registry import register
from deeppavlov.core.data.dataset_reader import DatasetReader

@register('dstc2_datasetreader')
class DSTC2DatasetReader(DatasetReader):

DataLearningIterator and DataFittingIterator

DataLearningIterator forms the sets of data (‘train’, ‘valid’, ‘test’) needed for training/inference and divides them into batches. A concrete DataLearningIterator class should be registered and can be inherited from deeppavlov.data.data_learning_iterator.DataLearningIterator class. This is a base class and can be used as a DataLearningIterator as well.

DataLearningIterator形成训练training/推理inference所需的数据集(“train”、“valid”、“test”)，并将它们分成批处理。一个具体的DataLearningIterator类实例应该被注册，并可以从deeppavlov.data.data_learning_iterator.DataLearningIterator 类继承。这是一个基类，也可以用作DataLearningIterator直接使用。

DataFittingIterator iterates over provided dataset without train/valid/test splitting and is useful for Estimator s that do not require training.

DataFittingIterator在不需要train/valid/test 分割的情况下，遍历（迭代）所提供的数据集，对于不需要训练的 Estimator非常有用。

Inference 推论

All components inherited from Component abstract class can be used for inference. The call() method should return standard output of a component. For example, a tokenizer should return tokens, a NER recognizer should return recognized entities, a bot should return an utterance. A particular format of returned data should be defined in call().

可以使用从组件抽象类Component abstract class 继承的所有组件进行推理。call() 方法应该返回组件的标准输出。例如，标记器应该返回标记，NER 识别器应该返回已识别的实体，bot 应该返回话语。返回数据的特定格式在__call__()中定义

Inference is triggered by interact_model() function. There is no need in a separate JSON for inference.

推理由 interact_model()函数触发。不需要在单独的 JSON 中进行推理。

Model Configuration 模型配置

Each DeepPavlov model is determined by its configuration file. You can use existing config files or create yours. You can also choose a config file and modify preprocessors/tokenizers/embedders/vectorizers there. The components below have the same interface and are responsible for the same functions, therefore they can be used in the same parts of a config pipeline.

每个 DeepPavlov 模型由其配置文件确定。你可以使用现有的配置文件或者创建你的配置文件。您还可以选择一个配置文件，并在该文件中修改preprocessors/tokenizers/embedders/vectorizers (预处理器 / 标记器 / 嵌入器 / 向量器)。下面介绍的这些组件具有相同的接口，并负责相同的功能，因此它们可以在配置管道的相同部分中使用。

Here is a list of useful Components aimed to preprocess, postprocess and vectorize your data.

这里是一个有用的组件列表，旨在预处理，后处理和向量化您的数据。

Preprocessors 预处理器

Preprocessor is a component that processes batch of samples.

预处理器是处理一批样本的组件。
在这里插入图片描述

Tokenizers

Tokenizer is a component that processes batch of samples (each sample is a text string).

Tokenizer 是一个处理批处理样本的组件(每个样本都是一个文本字符串)
在这里插入图片描述

Embedders 嵌入式

Embedder is a component that converts every token in a tokenized batch to a vector of a particular dimension (optionally, returns a single vector per sample).

Embedder 是一个组件，它将标记批处理中的每个标记转换为特定维度的向量(每个样本可以选择返回单个向量)。
在这里插入图片描述

Vectorizers 向量器

Vectorizer is a component that converts batch of text samples to batch of vectors.

向量器是一个将一批文本样本转换为一批向量的组件。
在这里插入图片描述

huangq_qiao

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
对话系统Deep Learning lab-2 配置文件Configuration file

Docs » Configuration file翻译自对话系统Deep Learning lab-2 配置文件Configuration file 配置文件An NLP pipeline config is a JSON file that contains one required element chainer:NLP 管道配置是一个 JSON 文件，它包含一个必需的元素 ch...
复制链接

扫一扫

专栏目录