ES学习（三）全文查询[超多图预警]

长弓成zerozero

已于 2022-01-31 00:22:04 修改

阅读量1.3k

点赞数 3

分类专栏： es linux 数据库文章标签：搜索引擎大数据 elasticsearch

于 2021-12-18 02:40:23 首次发布

本文链接：https://blog.csdn.net/changgongcheng_yq/article/details/121987756

版权

数据库同时被 3 个专栏收录

22 篇文章 0 订阅

订阅专栏

linux

16 篇文章 0 订阅

订阅专栏

4 篇文章 2 订阅

订阅专栏

三、es全文查询

es的特点就是全文查询，而全文查询和以前根据字段值查询最大的不同就是所谓的“分词”，既然学es倒排索引肯定了解过，文本分析的过程就是一个构建倒排索引的过程，也是大多数全文搜索引擎的工作机制。

3.1 _analyze 分析器

3.1.1 分析器介绍

分析器只对文本类型[text]有效，在1.建立text类型索引字段和2.搜索时可以配置分析器。

es内置了标准分析器standard等对英文支持比较好，因为涉及到自然语言处理等复杂的问题，大多数其他语言的分析器以插件的方式安装到我们第一节搭建环境时建的plugin目录中[/root/docker/es/plugins]，比如著名的中文分析器IK

分析器一般由三个部分组成，其中分词器有且仅有一个，字符过滤器和分词过滤器可以没有，也可以存在多个

字符过滤器：[0/n]修改特定的字符，如去除HTML标签、将&替换为and，第一步处理。

分词器：[1]将文本分割为课操作的片段，核心处理。

分词过滤器：[0/n]对每个分词使用所有过滤器，补充处理

3.1.2 配置分析器_analyze

// 建立索引时配置分析器
PUT /my_index{"settings":{"analysis":{"analyzer":{"std_english":{"type":"standard","stopwords":"_english_"}}}},"mappings":{"properties":{"my_text":{"type":"text","analyzer":"standard","fields":{"english":{"type":"text","analyzer":"std_english"}}}}}}

//使用_analyze接口测试分析器的结果
POST /my_index/_analyze{"field":"my_text","text":"The old brown cow"}
POST /my_index/_analyze{"field":"my_text.english","text":"The old brown cow"}

同样的句子“The old brow cow” 第一条语句代表[ "analyzer": "standard"]标准分析器，第二条语句代表使用上面自定义的分析器["analyzer": "std_english"]，这个分析器同样使用标准分析器standard，但自定义了停词器["stopwords": "_english_"]

3.2 全文搜索

为了方便测试，我们再导入一个日志文件，这个索引的文本数据更多。
ps：其实是我抄的资料上用的这个，我懒得自己写json查询语句去查kibana_sample_data_flights了^_^

这个是不是就更接近我们平时使用kibana查日志的样子了。

3.2.1 match 单字段匹配

POST/kibana_sample_data_logs/_search{"query":{"match":{"message":"Chrome Firefox"}},"_source":"message"}
POST/kibana_sample_data_logs/_search{"query":{"match":{"message":{"query":"Chrome FireFox","operator":"and"}}},"_source":"message"}

只返回了第一个匹配的Chrome字段

加了and条件，因为不存在既有chrome又有firefox的，所以查不到

3.2.2 multi_match 多字段匹配查询

POST /kibana_sample_data_flights/_search{"query":{"multi_match":{"query":"AT","fields":["DestCountry","OriginCountry"]}},"_source":["DestCountry","OriginCountry"]}

即任意一字段匹配都返回

3.2.3 match_phrase

put /my_index/_doc/1{"title":"quick brown fox"}
GET /my_index/_search{"query":{"match_phrase":{"title":{"query":"quick fox","slop":1}}}}

相邻词项查询，要求查询条件的单词必须全部存在且次序一致，slop默认为0，代表查询的词项相隔多远仍被视作匹配

3.2.4 模糊查询

getkibana_sample_data_logs/_search{"query":{"fuzzy":{"message":{"value":"firefix","fuzziness":1}}}}

fuzziness代表编辑距离，可以理解为文本相似度，数值越大匹配精度越差。

3.2.5 纠错

POST/kibana_sample_data_logs/_search?filter_path=suggest{"suggest":{"msg-suggest":{"text":"firefit chrom","term":{"field":"message"}}}}
POST/kibana_sample_data_logs/_search{"suggest":{"msg-suggest":{"text":"firefix with chrime","phrase":{"field":"message"}}}}}

3.2.6 提示器

类似百度和idea的自动联想补全，很多文本编辑器都有类似的功能

// 建立索引
PUT articles{"mappings":{"properties":{"author":{"type":"keyword"},"content":{"type":"text"},"suggestions":{"type":"completion"}}}}

// 插入数据并设置提示补全字段及其权重
POSTarticles/_doc/{"author":"taylor","content":"an introduction of elastic stack and elasticsearch","suggestions":{"input":["elastic stack","elasticsearch"],"weight":10}}
POSTarticles/_doc/{"author":"taylor","content":"an introduction of elastic stack and elasticsearch","suggestions":[{"input":"elasticsearch","weight":30},{"input":"elastic stack","weight":1}]}

//提示器 自动补全功能
POSTarticles/_search{"_source":"suggest","suggest":{"article_suggestion":{"prefix":"ela","completion":{"field":"suggestions"}}}}

3.3 组合查询

3.3.1 bool组合查询

是将一组布尔类型的子句组合起来，形成个大的布尔条件。

must：查询结果中必须要包含的内容

should：查询结果中非必须包含的内容，但包含了会提高查询分数

filter：查询结果中必须要包含的内容，不影响相关度

must_not：查询结果中不能包含的内容，不影响相关度

相关度可以理解为数据和查询条件的相似度，从查询结果上来说相关度决定了是否能查到以及文档的查询评分[决定了排序]

POST/kibana_sample_data_logs/_search{"query":{"bool":{"must":[{"match":{"message":"firefox"}}],"should":[{"term":{"geo.src":"CN"}},{"term":{"geo.dest":"CN"}}],"filter":[{"term":{"extension":"zip"}}]}},"sort":[{"_score":{"order":"desc"}}]}

3.3.2 dis_max组合查询

主要区别是计算相关度评分和bool组合查询不同，dis_max顾名思义，只取子查询中最大相关性分值而忽略其他子查询相关性得分，一般使用tie_breaker[0-1]参数作为系数来加入其他字段运算。

POST/kibana_sample_data_logs/_search{"query":{"dis_max":{"queries":[{"match":{"message":"firefox"}},{"term":{"geo. src":"CN"}},{"term":{"geo. dest":"CN"}}],"tie_breaker":0.7}}}

3.3.3 constant_score组合查询

顾名思义，按照_score指定分值查询。

POST/kibana_sample_data_logs/_search{"query":{"constant_score":{"filter":{"match":{"geo.dest":"CN"}},"boost":1.5}}}

3.3.4 boosting查询

positive：满足条件，类似bool查询中的must

negative：排除条件，类似bool查询中的must_not，但不会从查询结果中排除，只是会将分数*negative_boost值，从而降低分数来降低其相关性。

POST/kibana_sample_data_logs/_search{"query":{"boosting":{"positive":{"term":{"geo.src":"US"}},"negative":{"term":{"geo. dest":"CN"}},"negative_boost":0.2}},"sort":[{"_score":"asc"}]}

3.3.5 function_score查询

自定义打分函数

POST/kibana_sample_data_logs/_search{"query":{"function_score":{"query":{"query_string":{"fields":["message"],"query":"(firefox 6.0a1) OR (chrome 11.0.696.50)"}},"functions":[{"weight":2},{"random_score":{}}],"score_mode":"max","boost_mode":"avg"}}}

field_value_factor：打分函数

设置干扰字段为AvgTicketPrice，干扰字段的调整因子为0.001，modifier代表分数运算方式，missing表示字段丢失将使用这个值参与运算。

POST/kibana_sample_data_flights/_search{"query":{"function_score":{"query":{"bool":{"must":[{"match":{"OriginCountry":"CN"}},{"match":{"DestCountry":"US"}}]}},"field_value_factor":{"field":"AvgTicketPrice","factor":0.001,"modifier":"reciprocal","missing":1000}}}}

衰减函数：gauss

POSTkibana_sample_data_flights/_search{"query":{"function_score":{"query":{"match":{"OriginCityName":"Beijing"}},"gauss":{"timestamp":{"origin":"2019-03-25","scale":"7d","offset":"1d","decay":0.3}}}}}

3.4 聚集查询

3.4.1 指标聚集

// 平均值聚集
POST /kibana_sample_data_flights/_search?filter_path=aggregations{"query":{"match":{"DestCountry":"CN"}},"aggs":{"delay_avg":{"avg":{"field":"FlightDelayMin"}}}}

avg：取平均值，weighted_avg：加权平均值

// 计数聚集
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"country_code":{"cardinality":{"field":"DestCountry"}},"total_country":{"value_count":{"field":"DestCountry"}}}}

value_count计数相当于sql的count()；cardinality去重计数，相当于sql的count(distinct DestCountry)

//极值聚集
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"max_price":{"max":{"field":"AvgTicketPrice"}},"min_price":{"min":{"field":"AvgTicketPrice"}}}}

就是取最大值最小值。

// 统计聚集
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"price_stats":{"stats":{"field":"AvgTicketPrice"}}}}

就是针对属性多种聚集一起返回

stats：最小值、最大值、计数、和、平均值

extends_stats：又加了好多其他的统计数据，反正看不懂

// 百分位聚集
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"price_percentile":{"percentiles":{"field":"AvgTicketPrice","percents":[25,50,75,100]}},"price_percentile_rank":{"percentile_ranks":{"field":"AvgTicketPrice","values":[600,1200]}}}}

就是统计百分比分布情况

percentiles：值范围内占总值的百分比，即所有值都小优1200

percentile_ranks：百分比范围内最大的值，即25%价格都小于410

3.4.2 范围分桶

分桶相当于mysql的group by，就是对文档分组。

数值范围分桶有三种range、date_range、ip_range

// range
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"price_ranges":{"range":{"field":"AvgTicketPrice","ranges":[{"to":300},{"from":300,"to":600},{"from":600,"to":900},{"to":900}]}}}}

// date_range 仅支持日期类型
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"mar_flights":{"date_range":{"field":"timestamp","ranges":[{"from":"2019-03-01","to":"2019-03-30"}],"format":"yyyy-MM-dd"}}}}

//ip_range 仅支持ip类型
POST/kibana_sample_data_logs/_search?filter_path=aggregations{"aggs":{"local":{"ip_range":{"field":"clientip","ranges":[{"from":"157.4.77.0","to":"157.4.77.255"},{"from":"105.32.127.0","to":"105.32.127.255"}]}}}}

间隔范围分桶也有三种，histogram、date_histogram、auto_data_histogram

// histogram 根据指定间隔分组
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"price_histo":{"histogram":{"field":"AvgTicketPrice","interval":100,"offset":50,"keyed":false,"order":{"_count":"asc"}}}}}

//date_histogram 根据指定时间间隔分组
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"month flights":{"date_histogram":{"field":"timestamp","interval":"month"}}}}

//auto_date_histogram 先指定桶数，再分组
POST/kibana_sample_data_flights/_search?size=0{"aggs":{"age_group":{"auto_date_histogram":{"field":"timestamp","buckets":6}}}}

嵌套分桶、子聚集、聚集嵌套

其实就是通过多个aggs，对文档先后分组计算，比如这个先按时间分桶再计算其平均延误时间。

POST/kibana_sample_data_flights/_search?filter_path=aggregations{"query":{"term":{"OriginCountry":"CN"}},"aggs":{"date_price_histogram":{"date_histogram":{"field":"timestamp","interval":"month"},"aggs":{"avg_price":{"avg":{"field":"FlightDelayMin"}}}}}}

3.4.3 词项分桶

范围分桶通常只对数值类型和时间类型有效，词项分桶可以处理文本类型

// terms 热词展示 按词频总数统计
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"country_terms":{"terms":{"field":"DestCountry","size":10}},"country_terms_count":{"cardinality":{"field":"DestCountry"}}}}

//significant_terms  热词展示，指定文档子集，分别统计bg_count和doc_count，比例最高的在前面
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"query":{"term":{"OriginCountry":{"value":"IE"}}},"aggs":{"dest":{"significant_terms":{"field":"DestCountry"}}}}

//significant_text 专门为text设计的聚集函数，不需要开启fielddata机制，执行速度较慢
POST/kibana_sample_data_logs/_search?filter_path=aggregations{"query":{"term":{"response":{"value":"200"}}},"aggs":{"agent_term":{"significant_text":{"field":"message"}}}}

// 样本聚集 sample 限定聚集运算时采集的样本数量，如100
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"query":{"term":{"OriginCountry":{"value":"IE"}}},"aggs":{"sample_data":{"diversified_sampler":{"shard_size":100,"field":"AvgTicketPrice"},"aggs":{"dest_country":{"significant_terms":{"field":"DestCountry"}}}}}}

其中"field": "AvgTicketPrice"用于去重，可以不填。

3.4.4 单桶聚集与聚集组合

前面的聚集都是多桶聚集，即按组有多个返回结果，单桶聚集即聚集的返回结果只会生成一个桶。

// filter 过滤器 顾名思义按条件对聚集结果过滤
POST/kibana_sample_data_flights/_search?size=0&filter_path=aggregations{"aggs":{"origin_cn":{"filter":{"term":{"OriginCountry":"CN"}},"aggs":{"cn_ticket_price":{"avg":{"field":"AvgTicketPrice"}}}},"avg_price":{"avg":{"field":"AvgTicketPrice"}}}}
POST/kibana_sample_data_flights/_search?size=0&filter_path=aggregations{"aggs":{"origin_cn_us":{"filters":{"filters":[{"term":{"OriginCountry":"CN"}},{"term":{"OriginCountry":"US"}}]},"aggs":{"avg_ price":{"avg":{"field":"AvgTicketPrice"}}}}}}

//global 可以生成一个不受查询条件影响的聚集桶
POST/kibana_sample_data_flights/_search?size=0&filter_path=aggregations{"query":{"term":{"Carrier":{"value":"Kibana Airlines"}}},"aggs":{"kibana_avg_delay":{"avg":{"field":"FlightDelayMin"}},"all flights":{"global":{},"aggs":{"all_avg_delay":{"avg":{"field":"FlightDelayMin"}}}}}}

//missing 将某一字段缺失的文档归入一桶
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"no_price":{"missing":{"field":"AvgTicketPrice"}}}}

//composite 从不同的聚集中提取数据并以笛卡尔积的方式组合在一起
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"price_weather":{"composite":{"sources":[{"avg_price":{"histogram":{"field":"AvgTicketPrice","interval":500}}},{"weather":{"terms":{"field":"OriginWeather"}}}]}}}}
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"price_weather":{"composite":{"after":{"avg_price":500.0,"weather":"Cloudy"},"sources":[{"avg_price":{"histogram":{"field":"AvgTicketPrice","interval":500}}},{"weather":{"terms":{"field":"OriginWeather"}}}]}}}}

3.4.5 管道聚集

基于兄弟的聚集包括：avg_bucker、max_bucket、min_bucket、sum_bucket、stats_bucket、extended_stats_bucket、percentiles_bucket这七种，功能都是顾名思义的

//基于兄弟聚集
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"carriers":{"terms":{"field":"Carrier","size":10},"aggs":{"carrier_stat":{"stats":{"field":"AvgTicketPrice"}}}},"all_stat":{"avg_bucket":{"buckets_path":"carriers>carrier_stat.avg"}}}}

基于父聚集的管道聚集包括：moving_avg、moving_fn、bucket_script、bucket_selector、bucket_sort、derivative、cumulative_sum、serial_diff八中，其中前两种moving开头的是滑动窗口类型，中间三种bucket开头的是单桶运算类型的

//moving_fn 滑动窗口
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"day_price":{"date_histogram":{"field":"timestamp","interval":"day"},"aggs":{"avg_price":{"avg":{"field":"AvgTicketPrice"}},"smooth_price":{"moving_fn":{"buckets_path":"avg_price","window":10,"script":"MovingFunctions.unweightedAvg(values)"}}}}}}

//单桶运算 bucket_script、bucket_selector、bucket_sort
POST/kibana_sample_data_flights/_search?filter_path=aggregations{"aggs":{"date_price_diff":{"date_histogram":{"field":"timestamp","fixed_interval":"1d"},"aggs":{"stat_price_day":{"stats":{"field":"AvgTicketPrice"}},"diff":{"bucket_script":{"buckets_path":{"max_price":"stat_price_day.max","min_price":"stat_price_day.min"},"script":"params.max_price - params.min_price"}},"gt990":{"bucket_selector":{"buckets_path":{"max_price":"stat_price_day.max","min_price":"stat_price_day.min"},"script":"params.max_price - params.min_price > 990"}},"sort_by":{"bucket_sort":{"sort":[{"diff":{"order":"desc"}}]}}}}}}

3.4.6 父子关系

es的父子关系是指单个索引内部文档与文档之间的关系，要求父文档和子文档同属一个索引并通过父文档的id建立联系。

// join 父子关系文档建立
PUT employees{"mappings":{"properties":{"management":{"type":"join","relations":{"manager":"member"}}}}}
PUT /employees/_doc/1{"name":"tom","management":{"name":"manager"}}
PUT /employees/_doc/2?routing=1{"name":"smith","management":{"name":"member","parent":"1"}}
PUT /employees/_doc/3?routing=1{"name":"john","management":{"name":"member","parent":"1"}}

//has_child 根据子文档查询父文档
POST /employees/_search{"query":{"has_child":{"type":"member","query":{"match":{"name":"smith"}}}}}

//has_parent 根据父文档查询子文档
POST/employees/_search{"query":{"has_parent":{"parent_type":"manager","query":{"match":{"name":"tom"}}}}}

//parent_id 根据父文档id查询子文档
POST/employees/_search{"query":{"parent_id":{"type":"member","id":1}}}

//children 通过检索父文档获取与父文档关联的所有子文档
POST/employees/_search?filter_path=aggregations{"query":{"term":{"name":"tom"}},"aggs":{"members":{"children":{"type":"member"},"aggs":{"member_name":{"terms":{"field":"name.keyword","size":10}}}}}}

//parent 通过子文档查询父文档
POST/employees/_search?filter_path=aggregations{"query":{"match":{"name":"smith"}},"aggs":{"who_is_manager":{"parent":{"type":"member"},"aggs":{"manager_name":{"terms":{"field":"name.keyword","size":10}}}}}}

3.4.7 嵌套类型

nested类型是为了支持对象类型，为数组中的每一个对象创建单独的文档，以保存对象的信息并使他们可以检索。

//nested文档生产
PUT colleges { "mappings": { "properties": { "address":{ "type": "nested" },"age":{ "type":"integer" } } } }
PUT colleges/_doc/1 { "address": { "country":"CN" , "city":"BJ" },"age": 10 }
PUT colleges/_doc/2 { "address":[ { "country": "CN", "city":"BJ" },{ "country" : "US", "city": "NY" } ],"age":10 }

//nested 文档查询
POST /colleges/_search { "query":{ "nested":{ "path": "address", "query": { "bool":{ "must":[ {"match": {"address.country": "CN"}}, {"match": {"address.city": "NY"}} ] } } } } }

//nested 文档聚集
POST /colleges/_search?filter_path=aggregations { "aggs": { "nested_address":{ "nested":{ "path": "address" },"aggs":{ "city_names": { "terms":{ "field": "address.city.keyword", "size": 10 } } } } } }

//reverse_nested 文档聚集 相当于查同一城市的平均年龄
POST/colleges/_search?filter_path=aggregations{"aggs":{"nested address":{"nested":{"path":"address"},"aggs":{"city names":{"terms":{"field":"address.city.keyword","size":10},"aggs":{"avg_age_in_city":{"reverse_nested":{},"aggs":{"avg_age":{"avg":{"field":"age"}}}}}}}}}}

3.4.8 SQL支持

// 简单查询
POST_sql?format=txt{"query":"""  select DestCountry, OriginCountry,AvgTicketPrice from kibana_sample_data_flights  where Carrier = 'Kibana Airlines' order by AvgTicketPrice desc """}

// 游标分页
POST_sql?format=json{"query":""" select DestCountry, OriginCountry,AvgTicketPrice from kibana_sample_data_flights where Carrier = 'Kibana Airlines' order by AvgTicketPrice desc ""","fetch_size":2}

//describe 查询索引信息
POST_sql?format=txt{"query":"describe kibana_sample_data_flights"}
//show
POST _sql?format=txt { "query":"show columns in kibana_sample_data_flights" }
POST _sql?format=txt { "query":"show functions" }
POST _sql?format=txt { "query":"show tables" }

// 全文检索支持
POST _sql?format=txt { "query":""" select DestCountry, OriginCountry,AvgTicketPrice,score() from kibana_sample_data_flights where match(DestCountry,'CN') """ }
POST _sql?format=txt { "query":""" select DestCountry, OriginCountry,AvgTicketPrice,score() from kibana_sample_data_flights where query('DestCountry:CN') """ }