02.pipeline常用processor

文章目录

感觉ingest是es的一个着力点,因为现在ingest的processor越来越多了。

  1. Append Processor: 在一个已经存在的field上增加一些value
  2. Bytes Processor: 用于在"b", “kb”, “mb”, “gb”, “tb”, “pb” 之间进行单位换算
  3. Convert Processor: 对字段的类型进行转换设置
  4. Date Processor: 将原文档中的某个日期字段转换成一个Elasticsearch识别的时间戳字段(一般默认为@timestamp)
  5. Date Index Name Processor: 把文档按照日期分到按天或者月创建的索引当中去
  6. Dissect Processor: 像grok一样,但是语法更加简单
  7. Dot Expander Processor: 这个一般结合其他的processor使用,他使后面定义的processor能够使用.的方式去访问field
  8. Drop Processor: 删除doc的processor
  9. Fail Processor: 该处理器比较简单,就是当文档通过该pipeline的时候,一旦出现异常,该pipeline指定的错误信息就会返回给请求者
  10. Foreach Processor: 一个Foreach Processor是用来处理一些数组字段,数组内的每个元素都会使用到一个相同的处理器,比如
  11. GeoIP Processor: 将ip转成经纬度
  12. Grok Processor: 像logstash中强大的grok一样,可以提供非常强大的日志分割功能
  13. Gsub Processor: 使用正则来完成字符替换等功能
  14. HTML Strip Processor: 脱掉html标签
  15. Join Processor: 将数组内容jion成一个字符串,和python中的字符串的join方法很类似
  16. JSON Processor: 将符合json格式的字符装换成json
  17. KV Processor: 使用某个分隔符,将一个字段分割成k,v 格式
  18. Lowercase Processor: 将某个字段的内容都转成小写
  19. Pipeline Processor: 执行另一个pipeline
  20. Remove Processor: 删除某些字段
  21. Rename Processor: 修改某个field的name
  22. Script Processor: 使用es中的script来处理,直接是script的编程访问模式,script能访问哪些字段这里就能访问那些字段
  23. Set Processor: 指定字段存在时,修改指定字段的值,指定字段不存在时,新增该字段并设置该字段的值,可以修改_index的值哦
  24. Set Security User Processor:
  25. Split Processor: 用于将一个以指定分隔分开的字符串转换成一个数组类型的字段
  26. Sort Processor: 用于处理数组类型的字段,可以将存储在原文档中某个数组类型的字段中的元素按照升序或降序来对原元素进行排序
  27. Trim Processor: 专门用于处理字符串两端的空格问题
  28. Uppercase Processor: 该处理器类似于Lowercase Processor,将字符串文本统一转换成大写.
  29. URL Decode Processor: url翻译成string
  30. User Agent processor: 从http标准的ua信息中获取信息

这里仅仅介绍部分自己认为常用的pipline

1. Set Processor: 指定字段存在时,修改指定字段的值,指定字段不存在时,新增该字段并设置该字段的值,可以修改_index的值哦

使用样例,将一个field的value拷贝到另一个新的field上面

PUT _ingest/pipeline/set_os
{
  "description": "sets the value of host.os.name from the field os",
  "processors": [
    {
      "set": {
        "field": "host.os.name",
        "value": "{{os}}"
      }
    }
  ]
}

POST _ingest/pipeline/set_os/_simulate
{
  "docs": [
    {
      "_source": {
        "os": "Ubuntu"
      }
    }
  ]
}

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

value: 必须有 The value to be set for the field. Supports template snippets.

override: 非必须,默认为true,If processor will update fields with pre-existing non-null-valued field. When set to false, such fields will not be touched.

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

2. Append Processor: 在一个已经存在的field上增加一些value

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

value: 必须有 The value to be set for the field. Supports template snippets.

override: 非必须,默认为true,If processor will update fields with pre-existing non-null-valued field. When set to false, such fields will not be touched.

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例

PUT script_test/_mapping
{
  
    "properties":{
      "name":{
        "type":"keyword"
      },
      "age":{
        "type":"integer"
      },
      "age_arr":{
        "type":"integer"
      }
    }
  
}



PUT script_test/_doc/2
{
  "name":"tengfei",
  "age":[22,23],
  "age_arr":[12,15,13,98,102]
}

PUT script_test/_doc/3
{
  "name":"tengfei",
  "age":22,
  "age_arr":[12,15,13,98,102]
}


PUT _ingest/pipeline/append_pipe
{
  "description": "append to friend",
  "processors": [
    {"append": {
      "field": "age",
      "value": [23,78]
    }}
  ]
}


PUT script_test/_doc/23?pipeline=append_pipe
{
  "name":"append test"
  "age":88

}

对应放进去的doc为
{
        "_index" : "script_test",
        "_type" : "_doc",
        "_id" : "23",
        "_score" : 1.0,
        "_source" : {
          "name" : "append test",
          "age" : [
            23,
            78,
	    88
          ]
        }
}

相对于update_by_query中的script操作

POST script_test/_update_by_query
{
  "query":{
    "match_all":{}
  },
  "script":{
    "lang":"painless",
    "source":"ctx._source.age?.add(params.new_age)",
    "params":{
      "from":"china",
      "new_age":55
    }
  }
}

这个操作会报错,因为,age字段有些不是数组,直接存储的integer

 "script": "ctx._source.age?.add(params.new_age)",
    "lang": "painless",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "dynamic method [java.lang.Integer, add/1] not found"
    }

但是这个操作换做在ingest pipeline当中则是正常可以执行的。

3. Drop Processor: 删除doc的processor

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例

PUT _ingest/pipeline/drop_pipeline
{
  "description": "drop doc when name is chen",
  "processors": [
    {
      "drop": {
        "if": "ctx.name == 'chen'"
      }
    }
  ]
}


PUT script_test/_doc/31?pipeline=drop_pipeline
{
  "name":"chen",
  "age":88
}

返回
{
  "_index" : "script_test",
  "_type" : "_doc",
  "_id" : "31",
  "_version" : -3,
  "result" : "noop", # 这里的意思就是跳过了,不处理
  "_shards" : {
    "total" : 0,
    "successful" : 0,
    "failed" : 0
  }
}



PUT script_test/_doc/32?pipeline=drop_pipeline
{
  "name":"chenchuang",
  "age":88
}

返回
{
  "_index" : "script_test",
  "_type" : "_doc",
  "_id" : "32",
  "_version" : 1,
  "result" : "created", # created 暗示已经创建成功
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "_seq_no" : 21,
  "_primary_term" : 1
}

4. Remove Processor: 删除某些字段

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

ignore_missing: 默认为false, If true and field does not exist or is null, the processor quietly exits without modifying the document

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/remove_pipeline
{
  "description": "remove some fields",
  "processors": [
    {
      "remove": {
        "field": ["age01","age"]
      }
    }
  ]
}

PUT script_test/_doc/33?pipeline=remove_pipeline
{
  "name":"remove test",
  "age":[123,45,67],
  "age01":32,
  "age_arr":[34,21]
}

GET script_test/_doc/33

返回
{
  "_index" : "script_test",
  "_type" : "_doc",
  "_id" : "33",
  "_version" : 1,
  "_seq_no" : 22,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "remove test",
    "age_arr" : [
      34,
      21
    ]
  }
}



5. Rename Processor: 修改某个field的name

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

target_field: 必须要有,The new name of the field. Supports template snippets.

ignore_missing: 默认为false, If true and field does not exist or is null, the processor quietly exits without modifying the document

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/rename_pipeline
{
  "description": "rename fields",
  "processors": [
    {
      "rename": {
        "field": "age",
        "target_field": "life"
      }
    }
  ]
}

PUT script_test/_doc/35?pipeline=rename_pipeline
{
  "name":"rename test",
  "age":108
}

GET script_test/_doc/35

返回
{
  "_index" : "script_test",
  "_type" : "_doc",
  "_id" : "35",
  "_version" : 1,
  "_seq_no" : 23,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "rename test",
    "life" : 108
  }
}


6. Join Processor: 将某个field的数组内容jion成一个字符串,和python中的字符串的join方法很类似

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

separator: 必须,The separator character

target_field: The field to assign the joined value to, by default field is updated in-place

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/join_pipe
{
  "description": "join some fields",
  "processors": [
    {"join": {
      "field": "age_arr",
      "separator": "*",
      "target_field":"join_result"
    }}
  ]
}

PUT script_test/_doc/36?pipeline=join_pipe
{
  "name":"rename test",
  "age":108,
  "age_arr":[12,17,123,987,9]
}

GET script_test/_doc/36

返回
"_source" : {
    "name" : "rename test",
    "join_result" : "12*17*123*987*9",
    "age_arr" : [
      12,
      17,
      123,
      987,
      9
    ],
    "age" : 108
  }

7. JSON Processor: 将符合json格式的字符装换成json

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

target_field: The field to insert the converted structured object into

add_to_root: 默认为false,Flag that forces the serialized json to be injected into the top level of the document. target_field must not be set when this option is chosen.

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/json_pipe
{
  "description": "json pipeline",
  "processors": [
    {
      "json": {
        "field": "child",
        "target_field": "child_obj"
      }
    }
  ]
}

PUT script_test/_doc/37?pipeline=json_pipe
{
  "name":"rename test",
  "age":108,
  "child":"{\"son\":\"datou\"}"
}

GET script_test/_doc/37

返回
{
  "_index" : "script_test",
  "_type" : "_doc",
  "_id" : "37",
  "_version" : 1,
  "_seq_no" : 26,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "rename test",
    "child_obj" : {
      "son" : "datou"
    },
    "age" : 108,
    "child" : """{"son":"datou"}"""
  }
}



8. KV Processor: 使用某个分隔符,将一个字段分割成k,v 格式

这个看起来是挺复杂的,主要是像logstash一样,把一行日志解析为多个filed,比如把ip=1.2.3.4 error=REFUSED解析为ip, error两个field

使用样例


PUT _ingest/pipeline/kv_pipe
{
  "description": "kv pipeline",
  "processors": [
    {
      "kv": {
        "field": "message",
        "field_split": " ",
        "value_split": "="
      }
    }
  ]
}


9. Split Processor: 用于将一个以指定分隔分开的字符串转换成一个数组类型的字段

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

separator: 必须有,A regex which matches the separator, eg , or \s+

target_field: The field to assign the split value to, by default field is updated in-place

ignore_missing: 默认false,If true and field does not exist, the processor quietly exits without modifying the document

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/split
{
  "description": "split pipeline",
  "processors": [
    {
      "split": {
        "field": "my_field",
        "separator": "\\s+"
      }
    }
  ]
}

10. Lowercase Processor: 将某个字段的内容都转成小写

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

target_field: The field to assign the converted value to, by default field is updated in-place

ignore_missing: If true and field does not exist or is null, the processor quietly exits without modifying the document

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/lowercase_pipe
{
  "description": "lowercase pipeline",
  "processors": [
    {
      "lowercase": {
        "field": "name"
      }
    }
  ]
}

11. Uppercase Processor: 该处理器类似于Lowercase Processor,将字符串文本统一转换成大写.

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

target_field: The field to assign the converted value to, by default field is updated in-place

ignore_missing: If true and field does not exist or is null, the processor quietly exits without modifying the document

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例


PUT _ingest/pipeline/uppercase_pipe
{
  "description": "uppercase pipeline",
  "processors": [
    {
      "uppercase": {
        "field": "name"
      }
    }
  ]
}

12. Convert Processor: 对字段的类型进行转换设置

使用样例

PUT _ingest/pipeline/my-pipeline-id
{
  "description": "converts the content of the id field to an integer",
  "processors" : [
    {
      "convert" : {
        "field" : "id",
        "type": "integer"
      }
    }
  ]
}

13. Date Index Name Processor: 把文档按照日期分到按天或者月创建的索引当中去

field: 必须有, The field to insert, upsert, or update. Supports template snippets.

value: 必须有 The value to be set for the field. Supports template snippets.

override: 非必须,默认为true,If processor will update fields with pre-existing non-null-valued field. When set to false, such fields will not be touched.

if: Conditionally execute this processor.

on_failure: Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure: 默认为false, Ignore failures for this processor. See Handling Failures in Pipelines.

tag: An identifier for this processor. Useful for debugging and metrics.

使用样例

PUT _ingest/pipeline/monthlyindex
{
  "description": "monthly date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "date1",
        "index_name_prefix" : "myindex-",
        "date_rounding" : "M"
      }
    }
  ]
}

PUT /myindex/_doc/1?pipeline=monthlyindex
{
  "date1" : "2016-04-25T12:02:01.789Z"
}
 
{
  "_index" : "myindex-2016-04-01",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 55,
  "_primary_term" : 1
}

使用模拟方式

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "monthly date-time index naming",
    "processors" : [
      {
        "date_index_name" : {
          "field" : "date1",
          "index_name_prefix" : "myindex-",
          "date_rounding" : "M"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "date1": "2016-04-25T12:02:01.789Z"
      }
    }
  ]
}

返回
{
  "docs" : [
    {
      "doc" : {
        "_index" : "<myindex-{2016-04-25||/M{yyyy-MM-dd|UTC}}>",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "date1" : "2016-04-25T12:02:01.789Z"
        },
        "_ingest" : {
          "timestamp" : "2020-10-27T06:30:58.273Z"
        }
      }
    }
  ]
}

这里的_index对应的"<myindex-{2016-04-25||/M{yyyy-MM-dd|UTC}}>"表达式代表的实际上就是2016-04-01

14. Dot Expander Processor: 这个一般结合其他的processor使用,他使后面定义的processor能够使用.的方式去访问嵌套的field

使用样例


PUT _ingest/pipeline/dot_pipeline
{
  "description": "dot expand pipeline",
  "processors": [
    {
      "dot_expander": {
        "field": "foo.bar"
      }
    }
  ]
}


PUT script_test/_doc/38?pipeline=dot_pipeline
{
  "foo.bar" : "value2",
  "foo" : {
    "bar" : "value1"
  }
}

GET script_test/_doc/38

返回
"_source" : {
    "foo" : {
      "bar" : [
        "value1",
        "value2"
      ]
    }
  }

15. Fail Processor: 该处理器比较简单,就是当文档通过该pipeline的时候,一旦出现异常,该pipeline指定的错误信息就会返回给请求者

使用样例


PUT _ingest/pipeline/fial_pipeline
{
  "description": "fail pipeline",
  "processors": [
    {
      "fail": {
        "if": "ctx.tags.contains('production') != true",
        "message": "The production tag is not present, found tags: {{tags}}"
      }
    }
  ]
}

16. Foreach Processor: 一个Foreach Processor是用来处理一些数组字段,数组内的每个元素都会使用到一个相同的处理器,比如

使用样例


PUT _ingest/pipeline/foreach_pipeline
{
  "description": "foreach pipeline",
  "processors": [
    {
      "foreach": {
        "field": "persons",
        "processor": {
          "remove": {
            "field": "_ingest._value.id"
          }
        }
      }
    }
  ]
}


PUT foreach_test/_doc/2?pipeline=foreach_pipeline
{
  "persons" : [
    {
      "id" : "1",
      "name" : "John Doe"
    },
    {
      "id" : "2",
      "name" : "Jane Doe"
    }
  ]
}


GET foreach_test/_search
返回

"_source" : {
          "persons" : [
            {
              "name" : "John Doe"
            },
            {
              "name" : "Jane Doe"
            }
          ]
        }

17. Pipeline Processor: 执行另一个pipeline

使用样例

PUT _ingest/pipeline/pipelineA
{
  "description" : "inner pipeline",
  "processors" : [
    {
      "set" : {
        "field": "inner_pipeline_set",
        "value": "inner"
      }
    }
  ]
}
 

PUT _ingest/pipeline/pipelineB
{
  "description" : "outer pipeline",
  "processors" : [
    {
      "pipeline" : {
        "name": "pipelineA"
      }
    },
    {
      "set" : {
        "field": "outer_pipeline_set",
        "value": "outer"
      }
    }
  ]
}

PUT /myindex/_doc/1?pipeline=pipelineB
{
  "field": "value"
}

对应存储后的doc是
{
  "field": "value",
  "inner_pipeline_set": "inner",
  "outer_pipeline_set": "outer"
}

18. Script Processor: 使用es中的script来处理,直接是script的编程访问模式,script能访问哪些字段这里就能访问那些字段

这个在script那一部分有详解,感觉processor中都用到了script

使用样例

PUT _ingest/pipeline/my_index
{
    "description": "use index:my_index and type:_doc",
    "processors": [
      {
        "script": {
          "source": """
            ctx._index = 'my_index';
            ctx._type = '_doc';
          """
        }
      }
    ]
}


PUT any_index/_doc/1?pipeline=my_index
{
  "message": "text"
}


19. Sort Processor: 用于处理数组类型的字段,可以将存储在原文档中某个数组类型的字段中的元素按照升序或降序来对原元素进行排序

使用样例


PUT _ingest/pipeline/sort_pipeline
{
  "description": "sort pipeline",
  "processors": [
    {
      "sort": {
        "field": "age_arr",
        "order": "desc"
      }
    }
  ]
}

PUT sort_test/_doc/1?pipeline=sort_pipeline
{
  "name":"age to be sort",
  "ages":[56,23,78,45,99],
  "age_arr":[56,23,78,45,99]
}

GET sort_test/_doc/1

返回

"_source" : {
    "name" : "age to be sort",
    "ages" : [
      56,
      23,
      78,
      45,
      99
    ],
    "age_arr" : [
      99,
      78,
      56,
      45,
      23
    ]
  }

20. Trim Processor: 专门用于处理字符串两端的空格问题

使用样例

PUT _ingest/pipeline/trim_pipe
{
  "description": "trim field",
  "processors": [
    {
      "trim": {
        "field": "foo"
      }
    }
  ]
}

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
`sklearn.pipeline`是Scikit-learn库中的一个模块,用于构建和管理机器学习流水线(pipeline)。机器学习流水线是一种将多个数据处理步骤和机器学习模型串联起来的方式,以便更方便地进行模型训练和预测。 在`sklearn.pipeline`中,可以通过`Pipeline`类来定义一个流水线对象。流水线对象由多个步骤组成,每个步骤可以是数据处理操作(如特征预处理、特征选择等)或机器学习模型。每个步骤都可以指定一些参数,以便自定义其行为。 使用流水线可以将不同的数据处理和建模步骤封装在一起,从而实现更高效、更简洁的机器学习工作流程。流水线可以确保在训练和预测时所有步骤按顺序执行,并且可以方便地进行参数调优和交叉验证。 下面是一个简单的示例,展示如何使用`sklearn.pipeline`构建一个简单的流水线: ```python from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # 定义流水线的步骤 steps = [ ('scaler', StandardScaler()), # 特征预处理 ('classifier', LogisticRegression()) # 分类器 ] # 创建流水线对象 pipeline = Pipeline(steps) # 使用流水线进行训练和预测 pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) ``` 在上述示例中,流水线包含两个步骤:特征预处理(使用`StandardScaler`进行特征缩放)和分类器(使用`LogisticRegression`进行分类)。可以根据实际需求自定义流水线的步骤和参数,并使用流水线进行模型训练和预测。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值