02.elasticsearch bucket aggregation查询


elasticsearch的aggregate查询现在越来越丰富了,目前总共有4类。

  1. metric aggregation: 主要是min,max,avg,sum,percetile 等单个统计指标的查询
  2. bucket aggregation: 主要是类似group by的查询操作
  3. matrix aggregation: 使用多个字段的值进行计算从而产生一个多维矩阵
  4. pipline aggregation: 主要是能够在其他的aggregation进行一些附加的处理来增强数据

本篇就主要学习bucket aggregation,bucket aggregation查询类似group by 查询,而且相对metric aggregation 查询来说,bucket agg可以有sub aggregation, 也就是可以进行嵌套,嵌套的sub agg可以是bucket agg也可以是 metric agg。

1. bucket aggregation 查询类型概览

Terms Aggregation: 典型的grop by 类型,按照某个field将文档进行分桶,如果该field的value是数组的话,则该文档会被统计到多个bucket当中
Range Aggregation: 一般是针对number field,指定多个范围进行bucket划分
Date Histogram Aggregation: 按照时间进行分bucket,自动按照月等进行划分
Date Range Aggregation: 按照时间范围进行bucket,类似range aggregation
Filter Aggregation: 就是一个简单的过滤器,和query中的filter功能类似
Filters Aggregation: 多个filter进行过滤
Histogram Aggregation: 柱状图的聚合

Missing Aggregation: 统计某个field不存在的doc
Adjacency Matrix Aggregation
Auto-interval Date Histogram Aggregation
Children Aggregation
Composite Aggregation
Diversified Sampler Aggregation
Geo Distance Aggregation
GeoHash grid Aggregation
GeoTile Grid Aggregation
Global Aggregation
IP Range Aggregation
Nested Aggregation
Parent Aggregation
Reverse nested Aggregation
Sampler Aggregation
Significant Terms Aggregation
Significant Text Aggregation

2. 数据准备

演唱会的票信息
GET seats1028/_search

{
"play" : "Auntie Jo",   # 演唱会名称
"date" : "2018-11-6",  # 时间
"theatre" : "Skyline",  # 地点
"sold" : false,      # 这个票是否已经卖出
"actors" : [         # 演员
	"Jo Hangum",
	"Jon Hittle",
	"Rob Kettleman",
	"Laura Conrad",
	"Simon Hower",
	"Nora Blue"
        ],
"datetime" : 1541497200000,
"price" : 8321,    # 票价
"tip" : 17.5,      # 优惠
"time" : "5:40PM"
}

总共有3w+条这样的数据

3. 使用样例

1. Terms Aggregation:

典型的grop by 类型,按照某个field将文档进行分桶,如果该field的value是数组的话,则该文档会被统计到多个bucket当中

1. 普通的terms agg
GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "term_price":{
      "terms": {
        "field": "price",
        "min_doc_count": 13,
        "size": 50
      }
    }
  }
}

返回
"aggregations" : {
    "term_price" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 35384,
      "buckets" : [
        {
          "key" : 910,
          "doc_count" : 13
        },
        {
          "key" : 3273,
          "doc_count" : 13
        },
        {
          "key" : 3648,
          "doc_count" : 13
        }
      ]
    }
  }

2. 嵌套一个metric agg 作为sub agg查询

按照row进行分组,取doc数量最多的前3个bucket,并计算每个bucket中的price的最大值。


GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "term_price":{
      "terms": {
        "field": "row",
        "min_doc_count": 13,
        "size": 3,
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "max_price": {
          "max": {
            "field": "price"
          }
        }
      }
    }
  }
}

返回

"aggregations" : {
    "term_price" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 13608,
      "buckets" : [
        {
          "key" : 2,
          "doc_count" : 5796,
          "max_price" : {
            "value" : 9998.0
          }
        },
        {
          "key" : 3,
          "doc_count" : 5796,
          "max_price" : {
            "value" : 9999.0
          }
        },
        {
          "key" : 1,
          "doc_count" : 5791,
          "max_price" : {
            "value" : 9999.0
          }
        }
      ]
    }
  }

3. 嵌套一个terms agg作为sub agg查询

先按照row进行bucket划分,给出doc数量前3的row对应的bucket,然后每个bucket按照number进行再分bucket, 并给出doc数量前三的number值对应的bucket。

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "term_price":{
      "terms": {
        "field": "row",
        "min_doc_count": 13,
        "size": 3,
        "order": {
          "_count": "desc"
        }
      },
      "aggs": {
        "number_term": {
          "terms": {
            "field": "number",
            "size": 3,
            "order": {
              "_count": "desc"
            }
          }
        }
      }
      
    }
  }
}

返回
"aggregations" : {
    "term_price" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 13608,
      "buckets" : [
        {
          "key" : 2,
          "doc_count" : 5796,
          "number_term" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 4368,
            "buckets" : [
              {
                "key" : 1,
                "doc_count" : 476
              },
              {
                "key" : 2,
                "doc_count" : 476
              },
              {
                "key" : 3,
                "doc_count" : 476
              }
            ]
          }
        },
        {
          "key" : 3,
          "doc_count" : 5796,
          "number_term" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 4368,
            "buckets" : [
              {
                "key" : 1,
                "doc_count" : 476
              },
              {
                "key" : 2,
                "doc_count" : 476
              },
              {
                "key" : 3,
                "doc_count" : 476
              }
            ]
          }
        },
        {
          "key" : 1,
          "doc_count" : 5791,
          "number_term" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 4363,
            "buckets" : [
              {
                "key" : 5,
                "doc_count" : 476
              },
              {
                "key" : 6,
                "doc_count" : 476
              },
              {
                "key" : 7,
                "doc_count" : 476
              }
            ]
          }
        }
      ]
    }
  }

2. Range Aggregation:

一般是针对number field,指定多个范围进行bucket划分,包含from数值,不包含to对应的数值

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "price_range": {
      "range": {
        "field": "price",
        "ranges": [
          {
            "from": 5000,
            "to": 6000
          }
        ]
      }
    }
  }
}

返回
"aggregations" : {
    "price_range" : {
      "buckets" : [
        {
          "key" : "5000.0-6000.0",
          "from" : 5000.0,
          "to" : 6000.0,
          "doc_count" : 3646
        }
      ]
    }
  }

3. Date Histogram Aggregation:

按照时间进行分bucket,自动按照月等进行划分

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "price_date_histogram": {
      "date_histogram": {
        "field": "datetime",
        "calendar_interval": "month"
      }
    }
  }
}

返回
  "aggregations" : {
    "price_date_histogram" : {
      "buckets" : [
        {
          "key_as_string" : "2018-03-01T00:00:00.000Z",
          "key" : 1519862400000,
          "doc_count" : 2310
        },
        {
          "key_as_string" : "2018-04-01T00:00:00.000Z",
          "key" : 1522540800000,
          "doc_count" : 3946
        },
        {
          "key_as_string" : "2018-05-01T00:00:00.000Z",
          "key" : 1525132800000,
          "doc_count" : 3948
        },
        {
          "key_as_string" : "2018-06-01T00:00:00.000Z",
          "key" : 1527811200000,
          "doc_count" : 3948
        },
        {
          "key_as_string" : "2018-07-01T00:00:00.000Z",
          "key" : 1530403200000,
          "doc_count" : 3948
        }
      ]
    }
  }

4. Date Range Aggregation

按照时间范围进行bucket,类似range aggregation

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "price_date_histogram": {
      "date_range": {
        "field": "datetime",
        "ranges": [
          {
            "from": "2018-10-01T00:00:00.000Z",
            "to": "2018-11-01T00:00:00.000Z"
          }
        ]
      }
    }
  }
}

返回

"aggregations" : {
    "price_date_histogram" : {
      "buckets" : [
        {
          "key" : "2018-10-01T00:00:00.000Z-2018-11-01T00:00:00.000Z",
          "from" : 1.538352E12,
          "from_as_string" : "2018-10-01T00:00:00.000Z",
          "to" : 1.5410304E12,
          "to_as_string" : "2018-11-01T00:00:00.000Z",
          "doc_count" : 3948
        }
      ]
    }
  }


5. Filter Aggregation

就是一个简单的过滤器,和query中的filter功能类似

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "sold_filter": {
      "filter": {
        "range": {
          "tip": {
            "gte": 10,
            "lte": 20
          }
        }
      },
      "aggs": {
        "max_price": {
          "max": {
            "field": "price"
          }
        }
      }
    }
  }
}

返回
"aggregations" : {
    "sold_filter" : {
      "doc_count" : 6300, # 这个是filter后的doc count
      "max_price" : {
        "value" : 9996.0
      }
    }
  }

6. Filters Aggregation

多个filter进行过滤, 对于每个filter过滤的结果再应用子agg查询

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "sold_filter": {
      "filters": {
        "filters": {    # 这个地方的用法还是挺怪异的,最终还是
          "tip_filter": {
            "range": {
              "tip": {
                "gte": 10,
                "lte": 20
              }
            }
          },
          "number_filter": {
            "range": {
              "number": {
                "gte": 5,
                "lte":10
              }
            }
          }
        }
      },
      "aggs": {
        "max_price": {
          "max": {
            "field": "price"
          }
        }
      }
    }
  }
}
返回

"aggregations" : {
    "sold_filter" : {
      "buckets" : {
        "number_filter" : {
          "doc_count" : 16072,
          "max_price" : {
            "value" : 9999.0
          }
        },
        "tip_filter" : {  
          "doc_count" : 6300,
          "max_price" : {
            "value" : 9996.0
          }
        }
      }
    }
  }

可以看到这里对每一个子的filter都进行了过滤

7. Histogram Aggregation

柱状图的聚合,这里用来聚合的字段一般是数值型,比较方便用来分组

GET seats1028/_search
{
  "size": 0,
  "aggs": {
    "tip_histogram":{
      "histogram": {
        "field": "tip",
        "interval": 4
      }
    }
  }
}

返回

"aggregations" : {
    "number_histogram" : {
      "buckets" : [
        {
          "key" : 16.0,
          "doc_count" : 4200
        },
        {
          "key" : 20.0,
          "doc_count" : 8400
        },
        {
          "key" : 24.0,
          "doc_count" : 17808
        },
        {
          "key" : 28.0,
          "doc_count" : 5794
        }
      ]
    }
  }

8. Missing Aggregation: 统计某个field不存在的doc

GET seats1028/_search
{
  "size":0,
  "aggs": {
    "miss_f": {
      "missing": {
        "field": "row"
      }
    }
  }
}

返回
"aggregations" : {
    "miss_f" : {
      "doc_count" : 1
    }
  }
  

9. nested aggs:用于nested的doc的聚合查询,一般是再有一个子查询来统计

数据样例
这个查询用于nested的doc的聚合查询,一般是再有一个子查询来统计
数据样例,班级里面有一个学生列表,学生有age,name属性

GET nest_test/_mapping
返回
{
    "mappings" : {
      "properties" : {
        "c_name" : {
          "type" : "text"
        },
        "class" : {
          "type" : "nested",
          "properties" : {
            "students" : {
              "type" : "nested",
              "properties" : {
                "age" : {
                  "type" : "integer"
                },
                "name" : {
                  "type" : "text"
                }
              }
            }
          }
        }
      }
    }
  }


对应的文档有两个
"_source" : {
          "c_name" : "start_class",
          "class" : {
            "students" : [
              {
                "name" : "jack chen",
                "age" : 30
              },
              {
                "name" : "jack man",
                "age" : 20
              },
              {
                "name" : "pony wang",
                "age" : 60
              },
              {
                "name" : "gebi wang",
                "age" : 90
              }
            ]
          }
        }

"_source" : {
          "c_name" : "sun_class",
          "class" : {
            "students" : [
              {
                "name" : "lucy chen",
                "age" : 30
              },
              {
                "name" : "lucy man",
                "age" : 20
              },
              {
                "name" : "dong wang",
                "age" : 60
              },
              {
                "name" : "chess wang",
                "age" : 90
              }
            ]
          }
        }

对应的查询


GET nest_test/_search
{
  "size": 0,
  "aggs": {
    "nested_agg": {
      "nested": {
        "path": "class.students"
      },
      "aggs": {
        "min_age": {
          "min": {
            "field": "class.students.age"
          }
        }
      }
    }
  }
}

返回
 "aggregations" : {
    "nested_agg" : {
      "doc_count" : 8,
      "min_age" : {
        "value" : 20.0
      }
    }
  }

10. child agg 查询,针对join类型的数据进行查询

数据准备,每个教室(class_room)可以有多个课程(subject),每个学生(student)可以选择一个或者多个class_room,这样class_room和student就构成了parent/child的关系


PUT join_class
{
  "mappings": {
    "properties": {
      "subject":{
        "type": "keyword"
      },
      "class_student":{
        "type": "join",
        "relations":{
          "class_room":"student"
        }
      }
    }
  }
}

PUT join_class/_doc/1
{
  "subject":["english","Chinese","Russia"],
  "class_student":{
    "name":"class_room"
  },
  "des":"this class room teach english, Chinese, Russia"
}

PUT join_class/_doc/2?routing=1
{
  "class_student":{
    "name":"student",
    "parent":1
  },
  "name":"jack"
}


PUT join_class/_doc/3?routing=1
{
  "class_student":{
    "name":"student",
    "parent":1
  },
  "name":"pony"
}

下面这个查询要查找的是每个subject的对应的有哪些学生


GET join_class/_search
{
  "size":0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "subject_term": {
      "terms": {
        "field": "subject",
        "size": 10
      },
      "aggs": {
        "subject_student": {
          "children": {
            "type": "student"
          },
          "aggs": {
            "term_name": {
              "terms": {
                "field": "name.keyword",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

返回

 "aggregations" : {
    "subject_term" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Chinese",
          "doc_count" : 1,
          "subject_student" : {
            "doc_count" : 2,
            "term_name" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "jack",
                  "doc_count" : 1
                },
                {
                  "key" : "pony",
                  "doc_count" : 1
                }
              ]
            }
          }
        },
        {
          "key" : "Russia",
          "doc_count" : 1,
          "subject_student" : {
            "doc_count" : 2,
            "term_name" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "jack",
                  "doc_count" : 1
                },
                {
                  "key" : "pony",
                  "doc_count" : 1
                }
              ]
            }
          }
        },
        {
          "key" : "english",
          "doc_count" : 1,
          "subject_student" : {
            "doc_count" : 2,
            "term_name" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "jack",
                  "doc_count" : 1
                },
                {
                  "key" : "pony",
                  "doc_count" : 1
                }
              ]
            }
          }
        }
      ]
    }
  }

11. parent agg 查询,针对join类型的数据进行查询

承接上面的数据样例,下面的请求查找每个学生选的课程


GET join_class/_search
{
  "size":0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "student_term": {
      "terms": {
        "field": "name.keyword",
        "size": 10
      },
      "aggs": {
        "subject_student": {
          "parent": {
            "type": "student"
          },
          "aggs": {
            "choose_subject": {
              "terms": {
                "field": "subject",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

返回

 "aggregations" : {
    "student_term" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "jack",
          "doc_count" : 1,
          "subject_student" : {
            "doc_count" : 1,
            "choose_subject" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "Chinese",
                  "doc_count" : 1
                },
                {
                  "key" : "Russia",
                  "doc_count" : 1
                },
                {
                  "key" : "english",
                  "doc_count" : 1
                }
              ]
            }
          }
        },
        {
          "key" : "pony",
          "doc_count" : 1,
          "subject_student" : {
            "doc_count" : 1,
            "choose_subject" : {
              "doc_count_error_upper_bound" : 0,
              "sum_other_doc_count" : 0,
              "buckets" : [
                {
                  "key" : "Chinese",
                  "doc_count" : 1
                },
                {
                  "key" : "Russia",
                  "doc_count" : 1
                },
                {
                  "key" : "english",
                  "doc_count" : 1
                }
              ]
            }
          }
        }
      ]
    }
  }

12. Composite Aggregation 多个维度的terms进行组合操作,类似多层terms的嵌套,但是结果不是嵌套的,和mysql中按照多个字段进行group by类似

数据初始化


PUT composite_test
{
  "mappings": {
    "properties": {
      "area": {
        "type": "keyword"
      },
      "userid": {
        "type": "keyword"
      },
      "sendtime": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}
POST composite_test/_bulk
{ "index" : {"_type" :"_doc"}}
{"area":"33","userid":"400015","sendtime":"2019-01-17 00:00:00"}
{ "index" : {"_type" : "_doc"}}
{"area":"33","userid":"400015","sendtime":"2019-01-17 00:00:00"}
{ "index" : {"_type" : "_doc"}}
{"area":"35","userid":"400016","sendtime":"2019-01-18 00:00:00"}
{ "index" : { "_type" : "_doc"}}
{"area":"35","userid":"400016","sendtime":"2019-01-18 00:00:00"}
{ "index" : {"_type" : "_doc"}}
{"area":"33","userid":"400017","sendtime":"2019-01-17 00:00:00"}


下面的查询会按照area,userid, sendtime 三个字段进行group by查询


GET composite_test/_search
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "sources": [
          {
            "area": {
              "terms": {
                "field": "area"
              }
            }
          },
          {
            "userid": {
              "terms": {
                "field": "userid"
              }
            }
          },
          {
            "sendtime": {
              "date_histogram": {
                "field": "sendtime",
                "fixed_interval": "1d",
                "format": "yyyy-MM-dd"
              }
            }
          }
        ]
      }
    }
  }
}

返回

"aggregations" : {
    "my_buckets" : {
      "after_key" : {
        "area" : "35",
        "userid" : "400016",
        "sendtime" : "2019-01-18"
      },
      "buckets" : [
        {
          "key" : {
            "area" : "33",
            "userid" : "400015",
            "sendtime" : "2019-01-17"
          },
          "doc_count" : 2
        },
        {
          "key" : {
            "area" : "33",
            "userid" : "400017",
            "sendtime" : "2019-01-17"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "area" : "35",
            "userid" : "400016",
            "sendtime" : "2019-01-18"
          },
          "doc_count" : 2
        }
      ]
    }
  }

13. Adjacency Matrix Aggregation,邻接矩阵聚合

邻接矩阵聚合,上面的composition是多个维度的terms求交,这个更弱一些,只能做指定的field的某些值进行邻接矩阵生成
使用上面的数据样例,下面的查询会返回area=33的doc统计,userid=400015的doc统计,同时还会返回area=33 & userid=400015的doc统计


GET composite_test/_search
{
  "size": 0,
  "aggs": {
    "composite_two": {
      "adjacency_matrix": {
        "filters": {
          "area_filter":{
            "terms":{
              "area":["33"]
            }
          },
          "user_id_filter":{
            "terms":{
              "userid":["400015"]
            }
          }
        }
      }
    }
  }

返回

"aggregations" : {
    "composite_two" : {
      "buckets" : [
        {
          "key" : "area_filter",
          "doc_count" : 3
        },
        {
          "key" : "area_filter&user_id_filter",
          "doc_count" : 2
        },
        {
          "key" : "user_id_filter",
          "doc_count" : 2
        }
      ]
    }
  }

14. global agg 查询,针对所有数据的查询

这个就是忽略query的过滤信息,直接针对index中的所有数据进行子聚合

GET seats1028/_search
{
  "size": 0, 
  "query": {
    "term": {
      "row": {
        "value": 5
      }
    }
  },
  "aggs": {
    "global_row": {
      "global": {},
      "aggs": {
        "avg_row": {
          "avg": {
            "field": "row"
          }
        }
      }
    },
    "avg_row02":{
      "avg": {
        "field": "row"
      }
    }
  }
}

返回

"aggregations" : {
    "global_row" : {
      "doc_count" : 30992,
      "avg_row" : {
        "value" : 4.333871123874673   # 这个值是从所有的doc中算出来的
      }
    },
    "avg_row02" : {
      "value" : 5.0  # 这个是query过滤后的doc中计算出来的
    }
  }

15. Significant Terms Aggregation: 自动查找显著性的关键字

这个是在keyword的字段中查找当前的显著性的字段,查找出现频率比较高的字段
还是使用案例来说明更靠谱,这里举例的是网页新闻news,每个新闻news有作者(author) title, topic,等信息
相关数据构造如下

PUT news
{
  "mappings": {
    "properties": {
      "published": {
        "type": "date",
        "format": "dateOptionalTime"
      },
      "author": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "topic": {
        "type": "keyword"
      },
      "views": {
        "type": "integer"
      }
    }
  }
}


POST news/_bulk
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "John Michael",
  "published": "2018-07-08",
  "title": "Tesla is flirting with its lowest close in over 1 1/2 years (TSLA)",
  "topic": "automobile",
  "views": "431"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "John Michael",
  "published": "2018-07-22",
  "title": "Tesla to end up like Lehman Brothers (TSLA)",
  "topic": "automobile",
  "views": "1921"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "John Michael",
  "published": "2018-07-29",
  "title": "Tesla (TSLA) official says that they are going to release a new self-driving car model in the coming year",
  "topic": "automobile",
  "views": "1849"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "John Michael",
  "published": "2018-08-14",
  "title": "Five ways Tesla uses AI and Big Data",
  "topic": "ai",
  "views": "871"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "John Michael",
  "published": "2018-08-14",
  "title": "Toyota partners with Tesla (TSLA) to improve the security of self-driving cars",
  "topic": "automobile",
  "views": "871"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "Robert Cann",
  "published": "2018-08-25",
  "title": "Is AI dangerous for humanity",
  "topic": "ai",
  "views": "981"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "Robert Cann",
  "published": "2018-09-13",
  "title": "Is AI dangerous for humanity",
  "topic": "ai",
  "views": "871"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "Robert Cann",
  "published": "2018-09-27",
  "title": "Introduction to Generative Adversarial Networks (GANs) in self-driving cars",
  "topic": "automobile",
  "views": "1183"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "Robert Cann",
  "published": "2018-10-09",
  "title": "Introduction to Natural Language Processing",
  "topic": "ai",
  "views": "786"
}
{
  "index": {
    "_index": "news"
  }
}
{
  "author": "Robert Cann",
  "published": "2018-10-15",
  "title": "New Distant Objects Found in the Fight for Planet X ",
  "topic": "astronomy",
  "views": "542"
}


查找每个作者关注最多的topic,那么该作者肯定在该topic的发问最多

GET news/_search
{
  "size": 0,
  "aggregations": {
    "authors": {
      "terms": {
        "field": "author"
      },
      "aggregations": {
        "significant_topic_types": {
          "significant_terms": {
            "field": "topic"
          }
        }
      }
    }
  }
}

返回

  "aggregations" : {
    "authors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "John Michael",
          "doc_count" : 5,
          "significant_topic_types" : {
            "doc_count" : 5,
            "bg_count" : 10,
            "buckets" : [
              {
                "key" : "automobile",
                "doc_count" : 4,
                "score" : 0.4800000000000001,
                "bg_count" : 5
              }
            ]
          }
        },
        {
          "key" : "Robert Cann",
          "doc_count" : 5,
          "significant_topic_types" : {
            "doc_count" : 5,  # Robert Cann 总的doc数量为5个
            "bg_count" : 10,  # index中所有的doc数量为10
            "buckets" : [
              {
                "key" : "ai",
                "doc_count" : 3,  # Robert Cann 的topic为ai的doc总共有3个
                "score" : 0.2999999999999999,
                "bg_count" : 4   ## 这里是指索引中topic是ai的文档总共有4个
              }
            ]
          }
        }
      ]
    }
  }

上面的统计说明John Michael 这位作者最关注的话题是 automobile(自动驾驶),而Robert Cann 最关注的是ai相关的话题,相关的bg_count的说明查看上面的注释

16. Significant Text Aggregation: 自动查找显著性的关键字

这个和上面的Significant terms Aggregation类似,就是针对的是text字段,而且会进行分词处理
使用上面的数据进行下面的查询


GET news/_search
{
  "query": {
    "match": {
      "title": " AI "
    }
  },
  "size": 0,
  "aggs": {
    "significant_title": {
      "significant_text": {
        "field": "title"
      }
    }
  }
}



返回

"aggregations" : {
    "significant_title" : {
      "doc_count" : 3,
      "bg_count" : 10,
      "buckets" : [
        {
          "key" : "ai",
          "doc_count" : 3,
          "score" : 2.3333333333333335,
          "bg_count" : 3
        }
      ]
    }
  }

17. Sampler Aggregation: 抽样数据聚合

这个一般是在significant_terms 查询的时候,有时候索引中的数据可能非常大,导致耗时也比较严重,可以用这个来做抽样聚合,抽取更相关的样本数据来进行聚合

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": ["kibana", "javascript"]
                    }
                }
            }
        }
    }
}

shard_size 参数指的是每个分片抽取的样本数量,默认为 100
返回

{
    ...
    "aggregations": {
        "sample": {
            "doc_count": 200,
            "keywords": {
                "doc_count": 200,
                "bg_count": 650,
                "buckets": [
                    {
                        "key": "elasticsearch",
                        "doc_count": 150,
                        "score": 1.078125,
                        "bg_count": 200
                    },
                    {
                        "key": "logstash",
                        "doc_count": 50,
                        "score": 0.5625,
                        "bg_count": 50
                    }
                ]
            }
        }
    }
}

18.Reverse nested Aggregation 在nested agg中仍然可以对parent 的数据进行统计

Reverse nested Aggregation 的作用主要是能够让聚合在作为 Nested Aggregation 子聚合的情况下,跳出嵌套类型,对根文档的数据作聚合计算。
有例子:

PUT /issues
{
    "mappings": {
         "properties" : {
             "tags" : { "type" : "keyword" },
             "comments" : { 
                 "type" : "nested",
                 "properties" : {
                     "username" : { "type" : "keyword" },
                     "comment" : { "type" : "text" }
                 }
             }
         }
    }
}


PUT issues/_doc/1
{
  "tags": [
    "bug",
    "improve"
  ],
  "comments": [
    {
      "username": "jack",
      "comment": " this is a bug"
    },
    {
      "username": "pony",
      "comment": " this is a improve"
    }
  ]
}


PUT issues/_doc/2
{
  "tags": [
    "advice",
    "improve"
  ],
  "comments": [
    {
      "username": "jack",
      "comment": " this is a good job "
    },
    {
      "username": "nacy",
      "comment": " this is a improvement"
    }
  ]
}




查询

GET /issues/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "comments": {
      "nested": {
        "path": "comments"
      },
      "aggs": {
        "top_usernames": {
          "terms": {
            "field": "comments.username"
          },
          "aggs": {
            "comment_to_issue": {
              "reverse_nested": {},
              "aggs": {
                "top_tags_per_comment": {
                  "terms": {
                    "field": "tags"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

返回

"aggregations" : {
    "comments" : {
      "doc_count" : 4,
      "top_usernames" : {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets" : [
          {
            "key" : "jack",
            "doc_count" : 2,
            "comment_to_issue" : {
              "doc_count" : 2,
              "top_tags_per_comment" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [
                  {
                    "key" : "improve",
                    "doc_count" : 2
                  },
                  {
                    "key" : "advice",
                    "doc_count" : 1
                  },
                  {
                    "key" : "bug",
                    "doc_count" : 1
                  }
                ]
              }
            }
          },
          {
            "key" : "nacy",
            "doc_count" : 1,
            "comment_to_issue" : {
              "doc_count" : 1,
              "top_tags_per_comment" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [
                  {
                    "key" : "advice",
                    "doc_count" : 1
                  },
                  {
                    "key" : "improve",
                    "doc_count" : 1
                  }
                ]
              }
            }
          },
          {
            "key" : "pony",
            "doc_count" : 1,
            "comment_to_issue" : {
              "doc_count" : 1,
              "top_tags_per_comment" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [
                  {
                    "key" : "bug",
                    "doc_count" : 1
                  },
                  {
                    "key" : "improve",
                    "doc_count" : 1
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }

在 Nested Aggregation 聚合下,Reverse nested Aggregation 的子聚合计算聚合的数据集是该嵌套文档的根文档。
根据 Reverse nested Aggregation 的作用,可以清楚这是一个专门作为 Nested Aggregation 子聚合的聚合计算,所以作为顶层聚合或者是作为非 Nested Aggregation 的子聚合是没意义的。
在默认情况下, Reverse nested Aggregation 将找到根文档,当然如果有多层嵌套,也可以通过 path 参数指定文档的路径。

co.elastic.clients.elasticsearch.core.aggregations 是 Java 客户端 ElasticSearch 的一个聚合(Aggregation)方法,用于对数据进行分析和统计。 具体使用方法可以参考以下示例: ```java import co.elastic.clients.base.*; import co.elastic.clients.elasticsearch.*; import co.elastic.clients.elasticsearch.core.*; import co.elastic.clients.elasticsearch.core.aggregations.*; import co.elastic.clients.elasticsearch.core.aggregations.bucket.*; import co.elastic.clients.elasticsearch.core.aggregations.metrics.*; import java.io.IOException; import java.util.*; public class ElasticSearchAggregationExample { public static void main(String[] args) throws IOException, ApiException { RestClientBuilder restClientBuilder = RestClient.builder( new HttpHost("localhost", 9200, "http") ); ElasticSearch client = new ElasticSearch(restClientBuilder); SearchRequest request = new SearchRequest() .index("my_index") .source(new SearchSource() .query(new MatchAllQuery()) .aggregations(new TermsAggregation("my_terms_agg") .field("my_field") .size(10) .subAggregations(new AvgAggregation("my_avg_agg") .field("my_other_field") ) ) ); SearchResponse response = client.search(request); TermsAggregationResult myTermsAggResult = response.aggregations().terms("my_terms_agg"); for (TermsAggregationEntry entry : myTermsAggResult.buckets()) { String term = entry.keyAsString(); long count = entry.docCount(); AvgAggregationResult myAvgAggResult = entry.aggregations().avg("my_avg_agg"); double avg = myAvgAggResult.value(); System.out.println(term + ": " + count + ", avg: " + avg); } client.close(); } } ``` 这个例子展示了如何使用 co.elastic.clients.elasticsearch.core.aggregations 方法来进行聚合查询。在这个例子中,我们使用了 TermsAggregation 和 AvgAggregation 两个聚合方法,对数据进行了分组和统计。具体步骤为: 1. 创建一个 SearchRequest 对象,并设置索引名称和查询条件。 2. 在查询条件中添加聚合条件。这里使用了 TermsAggregation 来对数据进行分组,然后使用 AvgAggregation 来统计每个分组的平均值。 3. 执行查询,并获取查询结果。 4. 使用聚合结果对象的方法来获取聚合结果,然后对结果进行处理。 需要注意的是,聚合方法的具体参数和用法可以参考 ElasticSearch 官方文档。同时,Java 客户端的版本和 ElasticSearch 的版本也需要匹配,否则可能会出现兼容性问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值