数据聚合
- 聚合的种类
- DSL实现聚合
- RestAPI实现聚合
聚合的分类
聚合(aggregations)可以实现对文档数据的统计、分析、运算。聚合常见的有三类:
- 桶(bucket)聚合:用来对文档做分组
- TermAggregation:按照文档字段值分组
- Date Histogram:按照日期划分,列如一周为一组,或者一月为一组
- 度量(Metric)聚合:用以计算一些值,比如:最大值、最小值、平均值等
- Avg:求平均值
- Max:求最大值
- Min:求最小值
- Stats:同时求max、min、avg、sum等
- 管道(pipeline)聚合:其他聚合的结果为基础做聚合
总结:
什么是聚合?
- 聚合是对文档数据的统计、分析、计算
聚合的常见种类有哪些?
- Bucket:对文档数据分组,并统计每组数量
- Metric:对文档数据做计算,列如avg
- Pipeline:基于其他聚合结果再做聚合
参与聚合的字段类型必须是:
- keyword
- 数值
- 日期
- 布尔
DSL实现Bucket聚合
现在,我们要统计所有数据中的酒店品牌有几种,此时可以根据酒店品牌的名称做聚合。类型为term类型,DSL示例:
GET /hotel/_search { "size":0, # 设置size为0,结果中不包含文档,只包含聚合结果,就是没有具体数据 "aggs":{ # 定义聚合 "brandAgg":{ # 给聚合起个名字 "terms":{ # 聚合的类型,按照品牌值的聚合,所以选择term "field":"brand", # 参与聚合的字段 "size": 20 # 希望获取的聚合的结果数量 } } }
结果示例:
Bucket聚合- 自定义排序规则
添加order字段
GET /hotel/_search { "size": 0, "aggs": { "brandAggs": { "terms": { "field": "brand", "size": 20, "order": { "_count": "asc" #根据count字段升序排序 } } } } }
Bucket聚合-限定聚合范围
默认情况下,bucket绝活是对索引库所有文档做聚合,我们可以限定要聚合的文档范围,只要添加query条件即可:
GET /hotel/_search { "query": { "range": { "price": { "gte": 10, "lte": 200 } } }, "size": 0, "aggs": { "brandAggs": { "terms": { "field": "brand", "size": 10, "order": { "_count": "desc" } } } } }
总结:
aggs代表聚合,与query同级,此时query的作用是?
- 限定聚合查询文档范围
聚合必须的三要素:
- 聚合的名称
- 聚合的类型
- 聚合的字段
聚合可配置属性有:
- size:指定聚合结果数量
- order:指定聚合结果排序方式
- field:指定聚合的字段
题外知识:
实现多条件复合查询:
通过bool连接多条条件。
# 多条件查询 GET /hotel/_search { "query": { "bool": { "must": [ { "range": { "price": { "gte": 10, "lte": 200 } } },{ "term": { "brand":{ "value": "7天酒店" } } } ] } } }
DSl实现Metrics聚合
聚合查询按照品牌分组的酒店的max,min,avg,sum分数
GET /hotel/_search { "size": 0, "aggs": { "brandAggs": { "terms": { "field": "brand", "size": 10, "order": { "scoreAggs.avg": "desc" } }, "aggs": { "scoreAggs": { "stats": { "field": "score" } } } } } }
例子:数据聚合带过滤条件的数据聚合
过滤条件
private SearchRequest extracted(RequestParams requestParams, SearchRequest searchRequest) { //2.1构建query BoolQueryBuilder boolQuery = QueryBuilders.boolQuery(); if (StringUtils.hasText(requestParams.getKey())){ boolQuery.must(QueryBuilders.matchQuery("all", requestParams.getKey())); }else { boolQuery.must(QueryBuilders.matchAllQuery()); } //keyword字符串过滤 品牌 if (StringUtils.hasText(requestParams.getBrand())){ boolQuery.filter(QueryBuilders.termQuery("brand", requestParams.getBrand())); } //keyword字符串过滤 城市 if (StringUtils.hasText(requestParams.getCity())){ boolQuery.filter(QueryBuilders.termQuery("city", requestParams.getCity())); } //keyword字符串过滤 星级 if (StringUtils.hasText(requestParams.getStarName())) { boolQuery.filter(QueryBuilders.termQuery("starName", requestParams.getStarName())); } //range价格过滤 价格 gte是>= lte是<= if (requestParams.getMinPrice() != null && requestParams.getMaxPrice() != null) { boolQuery.filter(QueryBuilders.rangeQuery("price").gte(requestParams.getMinPrice()).lte(requestParams.getMaxPrice())); } //算分控制 FunctionScoreQueryBuilder functionScoreQueryBuilder = QueryBuilders.functionScoreQuery( //原始查询 boolQuery, //function score数组 new FunctionScoreQueryBuilder.FilterFunctionBuilder[]{ //一个具体的functionscore数组 new FunctionScoreQueryBuilder.FilterFunctionBuilder( //过滤条件 QueryBuilders.termQuery("isAD", true), //算分函数 ScoreFunctionBuilders.weightFactorFunction(100) ) }); //2.2分页 final int page = requestParams.getPage(); final int size = requestParams.getSize(); searchRequest.source().query(functionScoreQueryBuilder).from((page-1)*size).size(size); //2.3排序 if (StringUtils.hasText(requestParams.getLocation())){ final String location = requestParams.getLocation(); searchRequest.source().sort(SortBuilders. geoDistanceSort("location", new GeoPoint(location)) .order(SortOrder.ASC) .unit(DistanceUnit.KILOMETERS) ); } return searchRequest; }
数据聚合和解析结果
@Override public Map<String, List<String>> filters(RequestParams params) { try { //准备reuquest SearchRequest searchRequest = new SearchRequest("hotel"); //准备DSl //query extracted(params,searchRequest); buildRequest(searchRequest); SearchResponse search = client.search(searchRequest, RequestOptions.DEFAULT); List<String> list = getListByName(search,"brandAgg"); List<String> list1 = getListByName(search,"cityAgg"); List<String> list2 = getListByName(search,"starAgg"); return Map.of( "brand",list, "city",list1, "starName",list2 ); } catch (IOException e) { throw new RuntimeException(e); } } private List<String> getListByName(SearchResponse search,String key) { Aggregations aggregations = search.getAggregations(); Terms terms = aggregations.get(key); List<? extends Terms.Bucket> buckets = terms.getBuckets(); return buckets.stream().map(MultiBucketsAggregation.Bucket::getKeyAsString).collect(Collectors.toList()); }
自动补全
- 拼音分词器
- 自定义分词器
- 自动补全查询
- 实现酒店搜索框自动补全
拼音分词器安装
解压缩之后拖放到Es的挂在目录中,然后重启es
查询容器挂载目录
docker inspect 容器id | grep Mounts -A 20
自定义分词器
elasticsearch分词器(analyzer)的组成包含三部分:
- character filters: 在tokenizer之前对文本进行处理。列如删除字符、替换字符
- tokenizer:将文本呢按照一定的规则切割成词条(term)。列如keyword,就是部分此;还有ik_smart。
- tokenizer filter:将tokenizer输出的词条做进一步处理。列如大小写转换、同义词处理、拼英处理等
创建索引库的时候使用拼英分词器,搜索的时候需要注意如果再用拼英分词器就会搜索同音查询不对,所以搜索的时候换一个用ik_max_word
# 自定义拼音分词器 PUT /test { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "ik_max_word", "filter": "py" } }, "filter": { "py": { "type": "pinyin", "keep_full_pinyin": false, "keep_joined_full_pinyin": true, "keep_original": true, "limit_first_letter_length": 16, "remove_duplicated_term": true, "none_chinese_pinyin_tokenize": false } } } }, "mappings": { "properties": { "name":{ "type": "text", "analyzer": "my_analyzer" } } } } POST /test/_doc/1 { "id":1, "name":"私自" } POST /test/_doc/2 { "id":2, "name":"四字" } GET /test/_search { "query": { "match": { "name": "调入私自" } } }
查询结果就会出错,因为查询也用的拼英分词器
搜索的时候应该用ik_smart
在定义索引库的时候
# 自定义拼音分词器 PUT /test { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "ik_max_word", "filter": "py" } }, "filter": { "py": { "type": "pinyin", "keep_full_pinyin": false, "keep_joined_full_pinyin": true, "keep_original": true, "limit_first_letter_length": 16, "remove_duplicated_term": true, "none_chinese_pinyin_tokenize": false } } } }, "mappings": { "properties": { "name":{ "type": "text", "analyzer": "my_analyzer", "search_analyzer": "ik_smart" } } } } DELETE /test POST /test/_doc/1 { "id":1, "name":"私自" } POST /test/_doc/2 { "id":2, "name":"四字" } GET /test/_search { "query": { "match": { "name": "调入私自" } } }
查询结果
自动补全
completion suggester 查询
elasticsearch提供了Completion Suggester查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回。为了提高补全查询的效率,对于文档中字段的类型有一些约束:
新建一个索引库,添加索引文档:
PUT comtest { "mappings": { "properties": { "title":{ "type": "completion" } } } } POST comtest/_doc { "title":["Sony","WH-1000XM3"] } POST comtest/_doc { "title":["SK-II","PITERA"] } POST comtest/_doc { "title":["Nintendo","switch"] }
索引库查询:注意field要全部小写
# 自动补全查询 GET comtest/_search { "suggest": { "testcom": { "text": "s", #关键字 "completion": { "FIELD": "title", # 补全字段 "skip_duplicates":true, # 跳过重复的 "size":10 # 获取前10条结果 } } } } GET comtest/_search { "suggest": { "titlesuggest": { "text": "s", "completion": { "field": "title", "skip_duplicates": true, "size": 10 } } } }
酒店数据自动补全 例子
实现思路如下:
1.修改hotel索引库结构,设置自定义拼音分词器
2.修改索引库的name、all字段,使用自定义分词器
3.索引库添加一个新字段suggestion,类型为completion类型,使用自定义分词器
4.给HotelDoc类添加suggestion字段,内容包含brand、business
5.重新导入数据到hotel库
1.修改hotel数据库索引结构
2.修改索引库的name、all字段,使用自定义分词器
3.索引库添加一个新字段suggestion,类型为completion类型,使用自定义分词器
PUT /hotel { "settings": { "analysis": { #自定义分词器 "analyzer": { "text_anlyzer":{ "tokenizer": "ik_max_word", "filter": "py" }, "completion_analyzer":{ "tokenizer": "ik_max_word", "filter":"py" } }, "filter":{ "py":{ #拼英分词器过滤器 "type":"pinyin", "keep_full_pinyin":false, "keep_joined_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "remove_duplicated_term":true, "none_chinese_pinyin_tokenize":false } } } }, "mappings": { "properties": { "id":{ "type": "keyword" }, "name":{ "type": "text", "analyzer": "text_anlyzer", "search_analyzer": "ik_smart", "copy_to": "all" }, "address":{ "type": "keyword", "index": false }, "price":{ "type": "integer" }, "score":{ "type": "integer" }, "brand":{ "type": "keyword", "copy_to": "all" }, "city":{ "type": "keyword" }, "starName":{ "type": "keyword" }, "business":{ "type": "keyword", "copy_to": "all" }, "location":{ "type": "geo_point" }, "pic":{ "type": "keyword", "index": false }, "all":{ "type": "text", "analyzer": "text_anlyzer", "search_analyzer": "ik_smart" }, "suggestion":{# 搜索补全 "type": "completion", "analyzer": "completion_analyzer" } } } }
4.给HotelDoc类添加suggestion字段,内容包含brand、business
5.重新导入数据到hotel库
@Data @NoArgsConstructor public class HotelDoc { private Long id; private String name; private String address; private Integer price; private Integer score; private String brand; private String city; private String starName; private String business; private String location; private String pic; private Object distance; private String isAD; private List<String> suggestion; public HotelDoc(Hotel hotel) { this.id = hotel.getId(); this.name = hotel.getName(); this.address = hotel.getAddress(); this.price = hotel.getPrice(); this.score = hotel.getScore(); this.brand = hotel.getBrand(); this.city = hotel.getCity(); this.starName = hotel.getStarName(); this.business = hotel.getBusiness(); this.location = hotel.getLatitude() + ", " + hotel.getLongitude(); this.pic = hotel.getPic(); if (this.business.contains("/")){ String[] split = this.business.split("/"); this.suggestion = new ArrayList<>(); this.suggestion.add(this.brand); Collections.addAll(this.suggestion,split); }else { this.suggestion = Arrays.asList(this.brand,this.business); } } }
RestAPI实现自动补全
先看请求参数构造的API:
先创建一个request对象,然后创建dsl搜索对象,client发送请求
/** * 实现字段补充查询 */ @Test void testSuggest() throws IOException { SearchRequest searchRequest = new SearchRequest("hotel"); searchRequest.source().suggest(new SuggestBuilder().addSuggestion( "mysuggestion", SuggestBuilders.completionSuggestion("suggestion") .prefix("h") .skipDuplicates(true) .size(10) )); SearchResponse search = client.search(searchRequest, RequestOptions.DEFAULT); System.out.println(search); }
结果解析:
RestAPI实现
/** * 实现字段补充查询 */ @Test void testSuggest() throws IOException { SearchRequest searchRequest = new SearchRequest("hotel"); searchRequest.source().suggest(new SuggestBuilder().addSuggestion( "mysuggestion", SuggestBuilders.completionSuggestion("suggestion") .prefix("h") .skipDuplicates(true) .size(10) )); SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT); // 4.处理结果 Suggest suggest = response.getSuggest(); // 4.1根据名称获取补全结果 CompletionSuggestion suggestion = suggest.getSuggestion("mysuggestion"); // 4.2获取options并遍历 for (Suggest.Suggestion.Entry.Option completionSuggestion : suggestion.getOptions()){ // 4.3获取一个option中的text,也就是补全词条 System.out.println(completionSuggestion.getText().toString()); } }
实现酒店搜索页面输入框的自动补全
查看前端页面,可以发现当我们在输入框键入时,前端会发起ajax请求
在服务端编写接口,接受该请求,返回补全结果的集合,类型为List<String>
数据同步
数据同步问题分析
elasticsearch中的酒店数据来自于mysql数据库,因此mysql数据发生改变时,elasticsearch也必须跟着改变,这个就是elasticsearch与mysql之间的数据同步
方案一:同步调用
方案二:异步通知
方案三:监听binlog
方式一:同步调用
- 优点:实现简单,粗暴
- 缺点:业务耦合度高
方式二:异步通知
- 低耦合,实现难度一般
- 依赖mq的可靠性
方式二:监听binlog
- 优点:完全接触服务间耦合
- 缺点:开启binlog增加数据库负担,实现复杂度高
案例:利用MQ实现mysql与elasticsearch数据同步
利用课前资料提供的hotel-admin项目作为酒店管理的微服务。当酒店数据发生增删改时候,要求对elasticsearch中数据也要完成相同操作。
步骤:
- 导入课前资料提供的hotel-admin项目,启动并测试酒店数据的CRUD
- 声明exchange、queue、Routingkey
- 在hotel-admin中的增、删、改业务中完成消息发送
- 在hotel-demo中完成消息监听,并更新elasticsearch中数据
- 启动并测试数据同步功能