注:version:elasticsearch-7.11.2
造测试数据:
PUT /t_shirt
{
"mappings": {
"properties": {
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"sold_date": {
"type": "date"
},
"remark": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
POST /t_shirt/_bulk
{"index":{}}
{"price":"100","color":"red","brand":"lining","sold_date":"2021-01-01","remark":"国产李宁,性价比高"}
{"index":{}}
{"price":"120","color":"green","brand":"lining","sold_date":"2021-01-01","remark":"绿色李宁,青春时尚"}
{"index":{}}
{"price":"150","color":"blue","brand":"lining","sold_date":"2021-01-02","remark":""}
{"index":{}}
{"price":"200","color":"red","brand":"adidas","sold_date":"2 021-01-01","remark":"百年品牌,值得信赖"}
{"index":{}}
{"price":"300","color":"green","brand":"adidas","sold_date":"2021-01-02","remark":""}
{"index":{}}
{"price":"500","color":"blue","brand":"adidas","sold_date":"2021-01-02","remark":""}
{"index":{}}
{"price":"220","color":"red","brand":"nike","sold_date":"2021-01-02","remark":"大品牌"}
{"index":{}}
{"price":"280","color":"green","brand":"nike","sold_date":"2021-01-02","remark":""}
{"index":{}}
{"price":"600","color":"blue","brand":"nike","sold_date":"2021-01-03","remark":""}
{"index":{}}
{"price":"400","color":"write","brand":"nike","sold_date":"2021-01-04","remark":""}
一.bucket和metric概念简介
bucket 就是一个聚合搜索时的数据分组。如:销售部门有员工张三和李四,开发部门有员工王五和赵六。那么根据部门分组聚合得到结果就是两个bucket。销售部门bucket中有张三和李四,开发部门 bucket 中有王五和赵六。
metric 就是对一个bucket数据执行的统计分析。如上述案例中,开发部门有2个员工,销售部门有2个员工,这就是metric。
metric有多种统计,如:求和,最大值,最小值,平均值等。
用一个大家容易理解的SQL语法来解释,如:select count(*) from table group by column。那么group by column分组后的每组数据就是bucket。对每个分组执行的count(*)就是metric。
二.实例
1.根据color分组统计销售数量
#aggs 是 aggregations 的缩写,代表聚集查询
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"_count": "desc"
}
}
}
}
}
group_by_color 是别名,可以自己起
1).aggs 是 aggregations 的缩写,代表聚集查询
2).size:0 可以在查询结果列屏蔽原数据,如果不加这行,在查询结果的"hits"里会显示满足条件的所有数据。
类似于sql:select color,count(1) doc_count from t_shirt group by color order by doc_count desc ;
2.统计不同color的平均价格
本案例先根据color执行聚合分组,在此分组的基础上,对组内数据执行聚合统计,这个组内数据的聚合统计就是metric。同样可以执行排序,因为组内有聚合统计,且对统计数据给予了命名avg_by_price,所以可以根据这个聚合统计数据字段名执行排序逻辑。
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"avg_by_price": "desc"
}
},
"aggs": {
"avg_by_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
avg_by_price 是别名,可以自己起。
类似于sql:
for (select color as c from t_shirt group by color) {
select color,avg(price) as avg_by_price from t_shirt where color = c order by avg_by_price desc;
}
3.根据color和brand分组统计平均价格
先根据color聚合分组,在组内根据brand再次聚合分组,这种操作可以称为下钻分析。
Aggs如果定义比较多,则会感觉语法格式混乱,aggs语法格式,有一个相对固定的结构,简单定义:aggs可以嵌套定义,可以水平定义。
嵌套定义称为下钻分析。水平定义就是平铺多个分组方式。
下钻分析格式:
GET /index_name/type_name/_search
{
"aggs": {
"定义分组名称(最外层)": {
"分组策略如:terms、avg、sum": {
"field": "根据哪一个字段分组",
"其他参数": ""
},
"aggs": {
"分组名称1": {},
"分组名称2": {}
}
}
}
}
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color",
"order": {
"avg_by_price_color": "desc"
}
},
"aggs": {
"avg_by_price_color": {
"avg": {
"field": "price"
}
},
"group_by_brand": {
"terms": {
"field": "brand",
"order": {
"avg_by_price_brand": "desc"
}
},
"aggs": {
"avg_by_price_brand": {
"avg": {
"field": "price"
}
}
}
}
}
}
}
}
4.统计不同color中的最大和最小价格、总价
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"max_price": {
"max": {
"field": "price"
}
},
"min_price": {
"min": {
"field": "price"
}
},
"sum_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
5.统计不同品牌中价格最高的颜色
在分组后,可能需要对组内的数据进行排序,并选择其中排名高的数据。那么可以使用 sort 来实现:top_hits 中的属性size代表取组内多少条数据(默认为10);sort代表组内使用什么字段什么规则排序(默认使用_doc的asc规则排序);_source代表结果中包含document中的那些字段(默认包含全部字段)
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"top_color": {
"top_hits": {
"size": 1,
"sort": [
{
"price": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"color",
"price"
]
}
}
}
}
}
}
}
6.histogram 区间统计
histogram类似terms,也是进行bucket分组操作的,是根据一个field,实现数据区间分组。
如下实例,histogram_by_price 是别名可以随便起,histogram 代表bucket分组操作,field 指定要聚合分组的字段,interval 代表区间间隔,如果interval: 200,那么统计的间隔就是0~200,201~400,401~600
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"histogram_by_price": {
"histogram": {
"field": "price",
"interval": 200
},
"aggs": {
"avg_by_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
7.date_histogram区间分组
date_histogram 可以对date类型的field执行区间聚合分组,如每月销量,每年销量等。
如:以月为单位,统计不同日期的销售数量及销售总金额。这个时候可以使用date_histogram实现聚合分组,其中field来指定用于聚合分组的字段,interval指定区间范围(可选值有:year、quarter、month、week、day、hour、minute、second),format指定日期格式化,min_doc_count指定每个区间的最少document(如果不指定,默认为0,当区间范围内没有document时,也会显示bucket分组),extended_bounds 指定起始时间和结束时间
GET /t_shirt/_search
{
"size": 0,
"aggs": {
"histogram_by_date": {
"date_histogram": {
"field": "sold_date",
"calendar_interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 1,
"extended_bounds": {
"min": "2021-01-01",
"max": "2022-12-31"
}
},
"aggs": {
"sum_by_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
8.query搭配aggs一起使用
aggs负责聚合查询,与query查询同一层级一起使用。query负责筛选数据,aggs负责聚合。
例如查询nike的平均价格:
GET /t_shirt/_search
{
"query": {
"match": {
"brand": "nike"
}
},
"size": 0,
"aggs": {
"avg_price_nike": {
"avg": {
"field": "price"
}
}
}
}
9.global bucket
global 用于定义一个全局 bucket,这个bucket会忽略query的条件,检索所有document进行对应的聚合统计。
在对比部分数据和总体数据的时候用到。
例如查询nike的平均价格与总体的平均价格对比:
GET /t_shirt/_search
{
"size": 0,
"query": {
"match": {
"brand": "nike"
}
},
"aggs": {
"nike_avg_price": {
"avg": {
"field": "price"
}
},
"all_avg_price": {
"global": {},
"aggs": {
"all_price": {
"avg": {
"field": "price"
}
}
}
}
}
}