终于到了最后一个业务需求:支持管理者对雇员目录做分析。 Elasticsearch 有一个功能叫聚合(aggregations),允许我们基于数据生成一些精细的分析结果。聚合与 SQL 中的 GROUP BY 类似但更强大。
基本聚合
举个例子,挖掘出雇员中最受欢迎的兴趣爱好:
GET /megacorp/employee/_search
{
“aggs”: {
“all_interests”: {
“terms”: { “field”: “interests” }
}
}
}
暂时忽略掉语法,直接看看结果:
{
…
“hits”: { … },
“aggregations”: {
“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2
},
{
“key”: “forestry”,
“doc_count”: 1
},
{
“key”: “sports”,
“doc_count”: 1
}
]
}
}
}
可以看到,两位员工对音乐感兴趣,一位对林地感兴趣,一位对运动感兴趣。这些聚合并非预先统计,而是从匹配当前查询的文档中即时生成。如果想知道叫 Smith 的雇员中最受欢迎的兴趣爱好,可以直接添加适当的查询来组合查询:
Client程序演示
增加一个方法:
/**
* 挖掘出雇员中最受欢迎的兴趣爱好 聚合搜索using aggrefations
* @param client
*/
private static void findInterestHobby(Client client) {
SearchRequestBuilder request = client.prepareSearch("megacorp1")
.setTypes("employee1")
.addAggregation(
AggregationBuilders.terms("agg1").field("interests")
);
SearchResponse response = request.get();
Aggregations aggs = response.getAggregations();
Map<String,Aggregation> map= aggs.asMap();
Set<String> set = map.keySet();
for (String str : set) {
System.out.println("agg name="+str);
Aggregation agg = map.get(str);
Map<String,Object> data = agg.getMetaData();
Set<String> dataSet = map.keySet();
for (String str2 : dataSet) {
StringTerms obj = (StringTerms) map.get(str2);
System.out.println("DocCountError="+obj.getDocCountError());
System.out.println("SumOfOtherDocCounts="+obj.getSumOfOtherDocCounts());
List<Bucket> buckes = obj.getBuckets();
for (Iterator iterator = buckes.iterator(); iterator.hasNext();) {
Bucket bucket = (Bucket) iterator.next();
String key = bucket.getKeyAsString();
System.out.println(key+"="+bucket.getDocCount());
}
}
} }
主方法中增加调用:
// 8.挖掘出雇员中最受欢迎的兴趣爱好 聚合搜索using aggrefations
findInterestHobby(client);
运行后结果报错:
Caused by: RemoteTransportException[[111][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: IllegalArgumentException[Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.];
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default.
...
fielddata
这里看下fielddata:
大多数字段默认都是索引的,这使得它们可以搜索。但是,在脚本中进行排序、聚合和访问字段值需要从搜索中获得不同的访问模式。
搜索需要回答“哪些文档包含这个术语?”排序和聚合需要回答一个不同的问题:“这个字段对这个文档的值是多少?”。
大多数字段可以使用索引时,找到值但是text文本字段不支持。
Text field使用fielddata的这种内存数据结构。它会在内存中存储反转整个索引的每个片段,包括文档关系。
因为它非常耗费内存所以默认是关闭的disabled,一般不必要设置的不要设置。
参考https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html
我们这里让interests这个字段设置为fielddata:true
让已存在的text field设能够fielddata:
再次调用,运行结果:
agg name=agg1
DocCountError=0
SumOfOtherDocCounts=0
music=11
sports=8
forestry=2
Head插件示例
结果太长了,只显示最后聚合的结果,hits返回的数据结果省略。(下同)
有查询条件的聚合
GET /megacorp/employee/_search
{
“query”: {
“match”: {
“last_name”: “smith”
}
},
“aggs”: {
“all_interests”: {
“terms”: {
“field”: “interests”
}
}
}
}
all_interests 聚合已经变为只包含匹配查询的文档:
…
“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2
},
{
“key”: “sports”,
“doc_count”: 1
}
]
}
Client程序演示
我们把刚才的方法请求部分加上查询条件,就如我们之前学习的那样:
SearchRequestBuilder request = client.prepareSearch("megacorp1")
.setTypes("employee1")
.setQuery(QueryBuilders.matchQuery("last_name","Smith"))
.addAggregation(
AggregationBuilders.terms("agg1").field("interests")
其他部分相同
调用结果:
agg name=agg1
DocCountError=0
SumOfOtherDocCounts=0
music=2
sports=1
Head插件示例
聚合支持分级汇总
聚合还支持分级汇总 。比如,查询特定兴趣爱好员工的平均年龄:
GET /megacorp/employee/_search
{
“aggs” : {
“all_interests” : {
“terms” : { “field” : “interests” },
“aggs” : {
“avg_age” : {
“avg” : { “field” : “age” }
}
}
}
}
}
得到的聚合结果有点儿复杂,但理解起来还是很简单的:
…
“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2,
“avg_age”: {
“value”: 28.5
}
},
{
“key”: “forestry”,
“doc_count”: 1,
“avg_age”: {
“value”: 35
}
},
{
“key”: “sports”,
“doc_count”: 1,
“avg_age”: {
“value”: 25
}
}
]
}
输出基本是第一次聚合的加强版。依然有一个兴趣及数量的列表,只不过每个兴趣都有了一个附加的 avg_age 属性,代表有这个兴趣爱好的所有员工的平均年龄。
即使现在不太理解这些语法也没有关系,依然很容易了解到复杂聚合及分组通过 Elasticsearch 特性实现得很完美。可提取的数据类型毫无限制。
Client程序演示
此部分可以参考https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/_structuring_aggregations.html
通俗点说,你可以在一个聚合下面再次聚合
增加一个方法:
/**
* 子聚合
* @param client
*/
private static void findAvgInterestHobby(Client client) {
SearchRequestBuilder request = client.prepareSearch("megacorp1")
.setTypes("employee1")
.addAggregation(
AggregationBuilders.terms("agg1").field("interests")
.subAggregation(AggregationBuilders.avg("avg_age").field("age"))
);
SearchResponse response = request.execute().actionGet();
//为了方便直接返回string了,类似第一个例子可以分析
System.out.println(response.toString());
}
main方法增加调用:
// 9.子聚合
findAvgInterestHobby(client);
结果显示:
{“took”:8,”timed_out”:false,”_shards”:{“total”:5,”successful”:5,”failed”:0},”hits”:{“total”:13,”max_score”:1.0,”hits”:[{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”5”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”8”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”1 Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”9”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”SmithSmithSmith”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”10”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”冬瓜核桃”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”12”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”蜂蜜”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”2”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”4”,”_score”:1.0,”_source”:{“first_name”:”Douglas1”,”last_name”:”Fir”,”age”:35,”about”:”I like to build cabinets”,”interests”:[“forestry”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”6”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith 1”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”1”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith1”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”7”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”1Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}}]},”aggregations”:{“agg1”:{“doc_count_error_upper_bound”:0,”sum_other_doc_count”:0,”buckets”:[{“key”:”music”,”doc_count”:11,”avg_age”:{“value”:26.90909090909091}},{“key”:”sports”,”doc_count”:8,”avg_age”:{“value”:25.0}},{“key”:”forestry”,”doc_count”:2,”avg_age”:{“value”:35.0}}]}}}
Head插件示例