Elasticsearch6.4专题之16:Ingest Node

风吹千里

已于 2024-05-19 23:49:13 修改

阅读量3.8k

点赞数

分类专栏： elasticsearch6.4 文章标签： elasticsearch ingest node

于 2019-11-10 23:07:26 首次发布

本文链接：https://blog.csdn.net/maligebazi/article/details/103003553

版权

elasticsearch6.4 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

Ingest Node

在实际文档编制索引之前，请使用Ingest Node对文档进行预处理。Ingest Node拦截bulk and index requests,转换过后，然后将文档传递回索引或批量API。

默认情况下，所有节点都启用ingest，因此任何节点都可以处理ingest tasks。您还可以创建专用的Ingest Node。要禁用节点的接收，请在elasticsearch.yml文件中配置以下设置：

node.ingest: false

要在索引之前对文档进行预处理，请定义一个指定一系列processors的pipeline 。每个processors.以某种特定方式转换文档。例如，pipeline可能具有一个processor，该processor从文档中删除一个字段，然后是另一个processor，该processor重命名了一个字段。cluster state将存储已配置的pipelines。

要使用管道，只需在索引或批量请求中指定pipeline参数。这样，ingest node知道要使用哪个管道。例如：

PUT my-index/_doc/my-id?pipeline=my_pipeline_id
{
  "foo": "bar"
}

有关创建，添加和删除管道的更多信息，请参见 Ingest APIs。

Pipeline Definition（管道定义）

pipeline是一系列processors的定义，这些processors将按照声明的顺序执行。pipeline包含两个主要字段：description 和processors列表：

{
  "description" : "...",
  "processors" : [ ... ]
}

这description是一个特殊字段，用于存储有关管道功能的有用描述。

该processors参数定义了要按顺序执行的处理器列表。

Ingest APIs

The following ingest APIs are available for managing pipelines:

Put Pipeline API to add or update a pipeline
Get Pipeline API to return a specific pipeline
Delete Pipeline API to delete a pipeline
Simulate Pipeline API to simulate a call to a pipeline(模拟管道的调用)

Put Pipeline API（添加或更新pipeline）

The put pipeline API adds pipelines and updates existing pipelines in the cluster.

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "bar"
      }
    }
  ]
}

put管道API还指示所有摄取节点重新加载其内存中的管道表示形式，以便管道更改立即生效。

Get Pipeline API（获取Pipeline）

The get pipeline API returns pipelines based on ID. This API always returns a local reference of the pipeline.

GET _ingest/pipeline/my-pipeline-id

{
  "my-pipeline-id" : {
    "description" : "describe pipeline",
    "processors" : [
      {
        "set" : {
          "field" : "foo",
          "value" : "bar"
        }
      }
    ]
  }
}

对于每个返回的管道，将返回source 和version。version对于知道节点拥有哪个版本的管道很有用。您可以指定多个ID以返回多个管道。还支持Wildcards。

Pipeline Versioning

管道可以选择添加一个version数字，该数字可以是任何整数值，以简化外部系统的管道管理。version字段是完全可选的，仅用于管道的外部管理。要取消设置version，只需替换管道而不指定管道。。

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "version" : 123,
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "bar"
      }
    }
  ]
}

要检查version，您可以使用过滤响应filter_path以将响应限制为version：

GET /_ingest/pipeline/my-pipeline-id?filter_path=*.version

这应该给出一个小的响应，使得解析起来既容易又便宜。

{
  "my-pipeline-id" : {
    "version" : 123
  }
}

Delete Pipeline API（删除Pipeline）

The delete pipeline API deletes pipelines by ID or wildcard match (my-*, *).

DELETE _ingest/pipeline/my-pipeline-id

Simulate Pipeline API(模拟Pipeline)

Simulate Pipeline API针对请求正文中提供的文档集执行特定的Pipeline。

您可以指定针对所提供文档执行的现有管道，或者在the body of the request中提供管道定义。

这是模拟请求的结构，请求主体中提供了管道定义：

POST _ingest/pipeline/_simulate
{
  "pipeline" : {
    // pipeline definition here
  },
  "docs" : [
    { "_source": {/** first document **/} },
    { "_source": {/** second document **/} },
    // ...
  ]
}

这是针对现有管道的模拟请求的结构：

POST _ingest/pipeline/my-pipeline-id/_simulate
{
  "docs" : [
    { "_source": {/** first document **/} },
    { "_source": {/** second document **/} },
    // ...
  ]
}

这是带有在请求及其响应中定义的管道的模拟请求的示例：

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "field" : "field2",
          "value" : "_value"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

{
   "docs": [
      {
         "doc": {
            "_id": "id",
            "_index": "index",
            "_type": "_doc",
            "_source": {
               "field2": "_value",
               "foo": "bar"
            },
            "_ingest": {
               "timestamp": "2017-05-04T22:30:03.187Z"
            }
         }
      },
      {
         "doc": {
            "_id": "id",
            "_index": "index",
            "_type": "_doc",
            "_source": {
               "field2": "_value",
               "foo": "rab"
            },
            "_ingest": {
               "timestamp": "2017-05-04T22:30:03.188Z"
            }
         }
      }
   ]
}

Viewing Verbose Results（查看详细结果）

您可以使用simulate pipeline API 来查看每个processor在通过管道传递摄取文档时如何影响摄取文档。要查看模拟请求中每个处理器的中间结果，可以将verbose参数添加到请求中。

这是详细请求及其响应的示例：

POST _ingest/pipeline/_simulate?verbose
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "field" : "field2",
          "value" : "_value2"
        }
      },
      {
        "set" : {
          "field" : "field3",
          "value" : "_value3"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

响应：

{
   "docs": [
      {
         "processor_results": [
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field2": "_value2",
                     "foo": "bar"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.674Z"
                  }
               }
            },
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field3": "_value3",
                     "field2": "_value2",
                     "foo": "bar"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.675Z"
                  }
               }
            }
         ]
      },
      {
         "processor_results": [
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field2": "_value2",
                     "foo": "rab"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.676Z"
                  }
               }
            },
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field3": "_value3",
                     "field2": "_value2",
                     "foo": "rab"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.677Z"
                  }
               }
            }
         ]
      }
   ]
}

Accessing Data in Pipelines（访问管道中的数据）

pipeline中的processors具有对通过pipeline的文档的读写访问权限。processors可以访问文档源中的字段以及文档的元数据字段。

Accessing Fields in the Source

访问源中的字段很简单。您只需通过字段名称来引用它们。例如：

{
  "set": {
    "field": "my_field",
    "value": 582.1
  }
}

最重要的是，始终可以通过_source前缀访问源中的字段：

{
  "set": {
    "field": "_source.my_field",
    "value": 582.1
  }
}

Accessing Metadata Fields

访问Metadata Fields的方式与访问source中的字段的方式相同。这是可能的，因为elasticsearch不允许source中与Metadata Fields同名的字段。

以下示例将_id文档的元数据字段设置为1：

{
  "set": {
    "field": "_id",
    "value": "1"
  }
}

以下元数据字段是由处理器访问的：_index，_type，_id，_routing。

Accessing Ingest Metadata Fields

除了metadata field和source fields之外，ingest还将ingest metadata添加到它处理的文档中。这些元数据属性可通过_ingest键访问。当前ingest将ingest timestamp添加到ngest metadata的_ingest.timestamp键下。ingest timestamp是Elasticsearch收到索引或批量请求以预处理文档的时间。

任何处理器都可以在文档处理期间添加ingest-related metadata。Ingest metadata是暂时的，并且在管道处理了文档之后会丢失。因此，将不会对ingest metadata建立索引。

以下示例添加了一个名称为received的字段。该值是ingest timestamp：

{
  "set": {
    "field": "received",
    "value": "{{_ingest.timestamp}}"
  }
}

与Elasticsearch metadata字段不同，ingest metadata字段名称_ingest可以用作文档源中的有效字段名称。用_source._ingest指source文档内的字段。否则，_ingest 将被认为ingest metadata field。

Accessing Fields and Metafields in Templates

许多processor设置也支持Templates。支持templating的设置可以包含零个或多个template snippets（模板片段）。template snippet 要以{{开头，以}}结束。访问模板中的字段和元字段与通过常规处理器字段设置完全相同。

以下示例添加了一个名为的字段field_c。它的值是值的串联field_a和field_b。

{
  "set": {
    "field": "field_c",
    "value": "{{field_a}} {{field_b}}"
  }
}

以下示例使用geoip.country_iso_code源中字段的值来设置文档将被索引到的索引：

{
  "set": {
    "field": "_index",
    "value": "{{geoip.country_iso_code}}"
  }
}

还支持动态字段名称。此示例将以service值命名字段，且设置该字段的值为字段code的值：

{
  "set": {
    "field": "{{service}}",
    "value": "{{code}}"
  }
}

Handling Failures in Pipelines

在最简单的用例中，pipeline定义了按顺序执行的processors列表，并在出现第一个异常时暂停处理。当发生故障时，此行为可能不是理想的。例如，您的日志可能与指定的grok表达式不匹配。除了停止执行之外，您可能希望将此类文档编入单独的索引。

若要启用此行为，可以使用on_failure参数。on_failure参数定义了在发生故障的处理器之后立即执行的处理器的列表。您可以在Pipelines级别以及processors级别指定此参数。如果processors指定on_failure配置，则无论配置是否为空，都将捕获处理器抛出的任何异常，并且管道将继续执行其余处理器。因为您可以在on_failure语句范围内定义其他处理器，所以可以嵌套异常处理。

以下示例定义了一个pipeline，该pipeline将foo已处理文档中的字段重命名为bar。如果文档不包含该foo字段，则处理器将错误消息附加到文档，以供以后在Elasticsearch中进行分析。

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "target_field" : "bar",
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "field \"foo\" does not exist, cannot rename to \"bar\""
            }
          }
        ]
      }
    }
  ]
}

下面的示例定义了整个管道上的on_failure块，以更改发送失败文档的索引。

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [ ... ],
  "on_failure" : [
    {
      "set" : {
        "field" : "_index",
        "value" : "failed-{{ _index }}"
      }
    }
  ]
}

或者，除了定义处理器exception时的行为外，还可以忽略exception并通过指定ignore_failure设置继续使用下一个处理器。

如果在下面的示例中该字段foo不存在，则将捕获故障并继续执行管道，在这种情况下，这意味着管道不执行任何操作。

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "target_field" : "bar",
        "ignore_failure" : true
      }
    }
  ]
}

ignore_failure可以在任何处理器设置，缺省设置false。

Accessing Error Metadata From Processors Handling Exceptions

您可能想要检索由failed processor抛出的实际错误消息。要做到这一点，您可以访问叫on_failure_message、on_failure_processor_type和on_failure_processor_tag的元数据字段。这些字段只能在on_failure块的上下文中访问。

这是您先前看到的示例的更新版本。但是该示例不是手动设置错误消息，而是利用on_failure_message 元数据字段来提供错误消息。

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "to" : "bar",
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "{{ _ingest.on_failure_message }}"
            }
          }
        ]
      }
    }
  ]
}

Processors

以下方式,在pipeline定义中定义所有processors：

{
  "PROCESSOR_NAME" : {
    ... processor configuration options ...
  }
}

每个processors定义自己的配置参数，但是所有处理器都具有声明tag和on_failure字段的能力。这些字段是可选的。

tag只是管道中某个处理器特定实例的字符串标识符。tag字段不会影响处理器的行为，但是对于标记和将错误跟踪到特定处理器非常有用。

请参阅Handling Failures in Pipelines以了解有关管道中的on_failure字段和错误处理的更多信息。

node info API可用于找出处理器集群中可用。node info API 将为每个节点提供可用处理器的列表。

自定义处理器必须安装在所有节点上。如果pipeline中指定的处理器并非在所有节点上都存在，则put pipeline API 将失败。如果您依赖于定制处理器插件，请通过plugin.mandatory在config/elasticsearch.yml文件中添加设置来确保将这些插件标记为必需，例如：

plugin.mandatory: ingest-attachment,ingest-geoip

如果这些插件中的任何一个都不可用，则节点将不会启动。

node stats API 可用于获取ingest usage统计数据，包括全局和每个管道的使用统计数据。有助于找出哪些管道使用最多或花在预处理上的时间最多。

Append Processor

如果字段已经存在并且是数组，则将一个或多个值追加到现有数组。将标量转换为数组，如果该字段存在并且为标量，则将一个或多个值附加到该数组。如果该字段不存在，则创建一个包含提供的值的数组。接受单个值或值的数组。

Table 28. Append Options

Name	Required	Default	Description
field	yes	-	The field to be appended to
value	yes	-	The value to be appended

{
  "append": {
    "field": "field1",
    "value": ["item2", "item3", "item4"]
  }
}

Bytes Processor

将人类可读的字节值（例如1kb）转换为以字节为单位的值（例如1024）。

支持的人类可读单位是不区分大小写的“ b”，“ kb”，“ mb”，“ gb”，“ tb”，“ pb”。如果该字段不是受支持的格式或结果值超过2 ^ 63，将发生错误。

Table 29. Bytes Options

Name	Required	Default	Description
field	yes	-	The field to convert
target_field	no	field	The field to assign the converted value to, by default field is updated in-place
ignore_missing	no	false	If true and field does not exist or is null, the processor quietly exits without modifying the document

{
  "bytes": {
    "field": "foo"
  }
}

Convert Processor

将现有字段的值转换为其他类型，例如将字符串转换为整数。如果字段值为数组，则将转换所有成员。

支持的类型包括：integer，long，float，double，string，boolean和auto。

指定boolean，如果字段的字符串值等于true（忽略大小写），则指定将字段设置为true；如果其字符串值等于false（忽略大小写），则将字段设置为false ；否则，将引发异常。

指定auto，将尝试将字符串值field转换为最接近的非字符串类型。例如，其值"true"将被转换为其各自的布尔类型的字段：true。请注意，float优先于double。值"242.15"将自动转换为float类型的242.15而不是double类型的。。如果不能正确转换提供的字段，则Convert Processor仍将成功处理，并将字段值保持原样。在这种情况下，target_field仍将使用未转换的字段值进行更新。

Table 30. Convert Options

Name	Required	Default	Description
field	yes	-	The field whose value is to be converted
target_field	no	field	The field to assign the converted value to, by default field is updated in-place
type	yes	-	The type to convert the existing value to
ignore_missing	no	false	If true and field does not exist or is null, the processor quietly exits without modifying the document

{
  "convert": {
    "field" : "foo",
    "type": "integer"
  }
}

Date Processor

解析字段中的日期，然后使用日期或时间戳记作为文档的时间戳记。默认情况下，日期处理器将解析后的日期添加为名为@timestamp的新字段。您可以通过设置target_field配置参数来指定其他字段。同一日期处理器定义中支持多种日期格式。它们将顺序使用，以尝试按照定义为处理器定义一部分的顺序来解析日期字段。

Table 31. Date options

Name	Required	Default	Description
field	yes	-	The field to get the date from.
target_field	no	@timestamp	The field that will hold the parsed date.
formats	yes	-	An array of the expected date formats. Can be a Joda pattern or one of the following formats: ISO8601, UNIX, UNIX_MS, or TAI64N.
timezone	no	UTC	The timezone to use when parsing the date.
locale	no	ENGLISH	The locale to use when parsing the date, relevant when parsing month names or week days.

这是一个基于initial_date字段将解析日期添加到timestamp字段的示例：

{
  "description" : "...",
  "processors" : [
    {
      "date" : {
        "field" : "initial_date",
        "target_field" : "timestamp",
        "formats" : ["dd/MM/yyyy hh:mm:ss"],
        "timezone" : "Europe/Amsterdam"
      }
    }
  ]
}

timezone和locale处理器参数模板。这意味着可以从文档中的字段中提取它们的值。下面的示例显示了如何从包含时区和区域设置值的摄取文档中的现有字段my_timezone和中提取区域设置/时区详细信息my_locale。
下面的示例演示如何在包含时区和区域值的已摄取文档中，从现有字段my_timezone 和 my_locale中提取locale/timezone详细信息。

{
  "description" : "...",
  "processors" : [
    {
      "date" : {
        "field" : "initial_date",
        "target_field" : "timestamp",
        "formats" : ["dd/MM/yyyy hh:mm:ss"],
        "timezone" : "Europe/Amsterdam"
      }
    }
  ]
}