01.ingest pipeline的使用简介

最新推荐文章于 2024-06-26 13:49:44 发布

夜月行者

最新推荐文章于 2024-06-26 13:49:44 发布

阅读量8.9k

点赞数 1

分类专栏： # pipeline 文章标签： elasticsearch

本文链接：https://blog.csdn.net/u013200380/article/details/109299561

版权

pipeline 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

文章目录

1. ingest的作用

在es cluster中每个node都会有分工，比如master,data,ingest等等，这个参考node的功能设置。ingest就是其中的一种。
ingest 主要就是对indexing request进行预处理，比如改变某个field的值，添加一两个field，忽略一些有问题的doc等。
也可以通过修改index的元数据字段_index来索引到不同的index当中。
ingest node的设置方式是在elasticsearch.yml文件中设置

node.ingest: true

这个值默认就是true，如果想要关闭就设置为false即可。
定义为ingest的node具有运行ingest程序的能力。在es中一个ingest程序被称为pipeline,这个pipeline也很形象，就像一个管子一样，将indexing request在pipe中进行一系列的处理，在pipeline中定义了很多个processor单元，这些processor就是一个处理单元。

总结一下，两个重要概念，pipeline, processor
pipeline: es对indexing的doc做一些预处理的程序
processor: pipeline中的每个处理单元，pipeline的能力都由processor来提供
在这里插入图片描述

ingest定义了当前es node 能否运行ingest程序，如果当前node不能运行的话就会转发给哪些设置为 node.ingest: true的节点进行处理

2. pipeline的使用

这一步介绍如何使用es中的pipeline ,主要包含了pipeline的简单使用样例，pipeline的定义，pipeline可以使用那些外部的data，pipeline中的if条处理，pipeline中的错误处理

1. pipeline的一个简单的使用样例

PUT _ingest/pipeline/my_pipeline_id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "new"
      }
    }
  ]
}

PUT my-index/_doc/my-id?pipeline=my_pipeline_id
{
  "foo": "bar"
}

GET my-index/_search

返回
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 },
  "hits" : {
    "total" : { "value" : 1, "relation" : "eq" },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my-index",
        "_type" : "_doc",
        "_id" : "my-id",
        "_score" : 1.0,
        "_source" : {
          "foo" : "new"
        }
      }
    ]
  }
}

2. pipeline的定义模板

定义pipeline的模板一般是下面这个样子

{
  "description" : "...",
  "processors" : [ ... ]
}

description 只是一个辅助性描述没有实际的功能，
processors 是一个list的processor，会按照顺序执行

3. processor中可以使用的data

processor 中可以获取到下面这些数据

_source中定义的字段
获取到index metadata fields
获取到ingest metadata fields
在template中使用fields的value

1. _source中定义的字段

{
  "set": {
    "field": "my_field",
    "value": 582.1
  }
}

{
  "set": {
    "field": "_source.my_field",
    "value": 582.1
  }

}

2. 获取到index metadata fields

{
  "set": {
    "field": "_id",
    "value": "1"
  }
}

3. 获取到ingest metadata fields

{
  "set": {
    "field": "received",
    "value": "{{_ingest.timestamp}}"
  }
}

4. 在template中使用fields的value

{
  "set": {
    "field": "field_c",
    "value": "{{field_a}} {{field_b}}"
  }
}

{
  "set": {
    "field": "_index",
    "value": "{{geoip.country_iso_code}}"
  }
}

{
  "set": {
    "field": "{{service}}",
    "value": "{{code}}"
  }
}

4. processor的条件处理

每个processor中都可以使用if条件来判断这个processor应该执行还是被跳过,在if中使用painless语法来进行判断并返回true/false
下面第一个样例就是network_name = Guest的doc会被删除

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "drop": {
        "if": "ctx.network_name == 'Guest'"
      }
    }
  ]
}

POST test/_doc/1?pipeline=drop_guests_network
{
  "network_name" : "Guest"
}

对condition的支持有4中情况

在condition中如何处理嵌套的字段
复杂的condition
在condition中引用别的processor
在condition中使用正则表达式

1. 在condition中如何处理嵌套的字段

需要注意两点

做非空判断
嵌套的使用.号index需要特殊处理

1. 判断非空

对于下面的pipeline

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "drop": {
        "if": "ctx.network.name == 'Guest'"
      }
    }
  ]
}

下面的doc会被删除而无法进入索引

POST test/_doc/1?pipeline=drop_guests_network
{
  "network": {
    "name": "Guest"
  }
}

下面的会报错，因为network为空

POST test/_doc/1?pipeline=drop_guests_network
{
"foo":"bar"
}

pipeline做修改

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "drop": {
        "if": "ctx.network?.name == 'Guest'"
      }
    }
  ]
}

这样就可以满足需求了，上面的文档不会被删除，而是直接进入index

2. 嵌套文档的点号使用

使用上面的改进过的pipeline的话这个doc是会进索引的,这就等于出现了bug（按照常理应该不行的,因为会命中drop processor），原因是是因为es的pipeline对于这种flattened嵌套的字段会当做一个字段处理，如果仍然先要支持原来的嵌套语法的解析则需要使用dot_expander进行特殊处理

POST test/_doc/2?pipeline=drop_guests_network
{
  "network.name": "Guest"
}

必须改成dot_expander processor才行

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "dot_expander": {
        "field": "network.name"
      }
    },
    {
      "drop": {
        "if": "ctx.network?.name == 'Guest'"
      }
    }
  ]
}

POST test/_doc/3?pipeline=drop_guests_network
{
  "network.name": "Guest"
}

多个条件的判断

{
  "drop": {
    "if": "ctx.network?.name != null && ctx.network.name.contains('Guest')"
  }
}

2. 复杂的condition

PUT _ingest/pipeline/not_prod_dropper
{
  "processors": [
    {
      "drop": {
        "if": "Collection tags = ctx.tags;if(tags != null){for (String tag : tags) {if (tag.toLowerCase().contains('prod')) { return false;}}} return true;"
      }
    }
  ]
}

3. 在condition中引用别的pipeline processor

PUT _ingest/pipeline/logs_pipeline
{
  "description": "A pipeline of pipelines for log files",
  "version": 1,
  "processors": [
    {
      "pipeline": {
        "if": "ctx.service?.name == 'apache_httpd'",
        "name": "httpd_pipeline"
      }
    },
    {
      "pipeline": {
        "if": "ctx.service?.name == 'syslog'",
        "name": "syslog_pipeline"
      }
    },
    {
      "fail": {
        "message": "This pipeline requires service.name to be either `syslog` or `apache_httpd`"
      }
    }
  ]
}

4. 在condition中使用正则表达式

PUT _ingest/pipeline/check_url
{
  "processors": [
    {
      "set": {
        "if": "ctx.href?.url =~ /^http[^s]/",
        "field": "href.insecure",
        "value": true
      }
    }
  ]
}

5. processor的错误处理

有些时候
In its simplest use case, a pipeline defines a list of processors that are executed sequentially, and processing halts at the first exception. This behavior may not be desirable when failures are expected. For example, you may have logs that don’t match the specified grok expression. Instead of halting execution, you may want to index such documents into a separate index.

To enable this behavior, you can use the on_failure parameter. The on_failure parameter defines a list of processors to be executed immediately following the failed processor. You can specify this parameter at the pipeline level, as well as at the processor level. If a processor specifies an on_failure configuration, whether it is empty or not, any exceptions that are thrown by the processor are caught, and the pipeline continues executing the remaining processors. Because you can define further processors within the scope of an on_failure statement, you can nest failure handling.

The following example defines a pipeline that renames the foo field in the processed document to bar. If the document does not contain the foo field, the processor attaches an error message to the document for later analysis within Elasticsearch.

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "target_field" : "bar",
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "field \"foo\" does not exist, cannot rename to \"bar\""
            }
          }
        ]
      }
    }
  ]
}

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [ ... ],
  "on_failure" : [
    {
      "set" : {
        "field" : "_index",
        "value" : "failed-{{ _index }}"
      }
    }
  ]
}

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "to" : "bar",
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "{{ _ingest.on_failure_message }}"
            }
          }
        ]
      }
    }
  ]
}

6. simulate pipeline

simulate用来调试pipeline的效果

POST /_ingest/pipeline/my-pipeline-id/_simulate
{
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

POST /_ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "field" : "field2",
          "value" : "_value"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}