10、es---高级用法

最新推荐文章于 2024-01-15 09:50:19 发布

置顶 sunxj1222

最新推荐文章于 2024-01-15 09:50:19 发布

阅读量1.1k

点赞数

分类专栏： es

本文链接：https://blog.csdn.net/sunxj1222/article/details/106640200

版权

es 专栏收录该内容

12 篇文章 3 订阅

订阅专栏

一、term vector

1、term vector ：获取document中的某个field内的各个term的统计信息

term information: term frequency in the field, term positions, start and end offsets, term payloads
term statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; document frequency，有多少document包含这个term
field statistics: document count，有多少document包含这个field; sum of document frequency，一个field中所有term的df之和; sum of total term frequency，一个field中的所有term的tf之和

2、index-time & query-time

（1）index-time，你在mapping里配置一下，然后建立索引的时候，就直接给你生成这些term和field的统计信息了
（2）query-time，你之前没有生成过任何的Term vector信息，然后在查看term vector的时候，直接就可以看到了。现场计算出各种统计信息，然后返回给你。

PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",------>index-time
"store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {----->query-time
"type": "text",
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}

3、实验

（1）存入数据

PUT /my_index/my_type/1
{
"fullname" : "Leo Li",
"text" : "hello test test test "
}

PUT /my_index/my_type/2
{
"fullname" : "Leo Li",
"text" : "other hello test ..."
}

（2）查看term信息

GET /my_index/my_type/1/_termvectors------------>查看某个field下term的统计信息
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

（3）查询结果说明

{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_version": 1,
"found": true,
"took": 10,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6,-------------------->所有doc的所有term的doc_freq相加
"doc_count": 2,--------------------->这个field存在于几个doc中
"sum_ttf": 8-------------------->所有doc的所有term的ttf相加
},
"terms": {
"hello": {--------------------->term
"doc_freq": 2,--------------------->有多少个doc包含这个term
"ttf": 2,--------------------->这个term在所有doc中出现的次数
"term_freq": 1,--------------------->这个term在当前这个doc的这个field出现的次数
"tokens": [--------------------->这个term在当前这个doc的这个field出现的每一次叫一个token
{
"position": 0,--------------------->这个term在当前这个doc的这个field的位置
"start_offset": 0,--------------------->这个term在当前这个doc的这个field的位置
"end_offset": 5,--------------------->这个term在当前这个doc的这个field的位置
"payload": "d29yZA=="--------------------->这个term的编码
}
]
},
"test": {
"doc_freq": 2,
"ttf": 4,
"term_freq": 3,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 10,
"payload": "d29yZA=="
},
{
"position": 2,
"start_offset": 11,
"end_offset": 15,
"payload": "d29yZA=="
},
{
"position": 3,
"start_offset": 16,
"end_offset": 20,
"payload": "d29yZA=="
}
]
}
}
}
}
}

4、手动指定doc的term vector---->手动指定要探查的term的数据情况,计算它在现有的所有doc中的一些统计信息

GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

5、multi term vector

GET _mtermvectors
{
"docs": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"term_statistics": true
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"fields": [
"text"
]
}
]
}

二、highlight

1、highlight中的field，必须跟query中的field一一对齐的

2、例如：

PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}

分词器测试:
GET _analyze
{
"text":"我发表的第一篇博客"
"analyzer":"ik_max_word"
}

PUT /blog_website/blogs/1
{
"title": "我的第一篇博客",
"content": "大家好，这是我写的第一篇博客，特别喜欢这个博客网站！！！"
}

GET /blog_website/blogs/_search
{
"query": {
"match": {
"title": "博客"
}
},
"highlight": {------->与上面的搜索对应，将title中搜索词进行高亮
"fields": {
"title": {}
}
}
}

3、三种highlight

（1）、plain highlight 默认值

（2）、posting highlight

性能比plain highlight要高，因为不需要重新对高亮文本进行分词
对磁盘的消耗更少
将文本切割为句子，并且对句子进行高亮，效果更好

PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"index_options": "offsets"--------->posting highlight
}
}
}
}
}

（3）、fast vector highlight

对大field而言（大于1mb），性能更高

PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"term_vector" : "with_positions_offsets"--------------->index-time打开term-vector
}
}
}
}
}

4、强制使用某种highlight

GET /blog_website/blogs/_search
{
"query": {
"match": {
"content": "博客"
}
},
"highlight": {
"fields": {
"content": {
"type": "plain"----------->强制使用某种highlighter
}
}
}
}

5、总结

般情况下，用plain highlight也就足够了，不需要做其他额外的设置
如果对高亮的性能要求很高，可以尝试启用posting highlight
如果field的值特别大，超过了1M，那么可以用fast vector highlight

三、template

1、说明

a、{{xx}}：参数名
b、关键词：inline

2、例1：入门

GET /blog_website/blogs/_search/template
{
"inline" : {
"query": {
"match" : {
"{{field}}" : "{{value}}" --------->这个就是template
}
}
},
"params" : {
"field" : "title",
"value" : "博客"
}
}
底层翻译为：
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客"
}
}
}

3、例2：toJson

2、toJson

GET /blog_website/blogs/_search/template
{
"inline": "{\"query\": {\"match\": {{#toJson}}matchCondition{{/toJson}}}}",---->将json对象生成json串
"params": {
"matchCondition": {
"title": "博客"
}
}
}
底层翻译为：
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客"
}
}
}

4、例3：join

GET /blog_website/blogs/_search/template
{
"inline": {
"query": {
"match": {
"title": "{{#join delimiter=' '}}titles{{/join delimiter=' '}}"
}
}
},
"params": {
"titles": ["博客", "网站"]
}
}
底层翻译为：
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客网站"
}
}
}

5、例4：default value

GET /blog_website/blogs/_search/template
{
"inline": {
"query": {
"range": {
"views": {
"gte": "{{start}}",
"lte": "{{end}}{{^end}}20{{/end}}"--------->默认值是20
}
}
}
},
"params": {
"start": 1,
"end": 10
}
}

6、例5：conditional

es的config/scripts目录下，预先保存这个复杂的模板，后缀名是.mustache，文件名是conditonal

{
"query": {
"bool": {
"must": {
"match": {
"line": "{{text}}"
}
},
"filter": {
{{#line_no}} ------------------>条件中指定了line_no，才使用
"range": {
"line_no": {
{{#start}}
"gte": "{{start}}"
{{#end}},{{/end}}
{{/start}}
{{#end}}
"lte": "{{end}}"
{{/end}}
}
}
{{/line_no}}
}
}
}
}

GET /my_index/my_type/_search/template
{
"file": "conditional",
"params": {
"text": "博客",
"line_no": true,
"start": 1,
"end": 10
}
}

7、使用场景：
比如说，一般在大型的团队中，可能不同的人，都会想要执行一些类似的搜索操作
这个时候，有一些负责底层运维的一些同学，就可以基于search template，封装一些模板出来，然后是放在各个es进程的scripts目录下的
其他的团队，其实就不用各个团队自己反复手写复杂的通用的查询语句了，直接调用某个搜索模板，传入一些参数就好了

四、completion suggest

1、suggest，completion suggest，自动完成，搜索推荐，搜索提示 --> 自动完成，auto completion

2、completion 类型：es实现的时候，是非常高性能的，会构建不是倒排索引，也不是正拍索引，就是纯的用于进行前缀搜索的一种特殊的数据结构，而且会全部放在内存中，所以auto completion进行的前缀搜索提示，性能是非常高的

PUT /news_website
{
"mappings": {
"news" : {
"properties" : {
"title" : {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"suggest" : {
"type" : "completion",---------->基于前缀完成搜索提示
"analyzer": "ik_max_word"
}
}
},
"content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}

搜索：

GET /news_website/news/_search
{
"suggest": {
"my-suggest" : {
"prefix" : "大话西游",
"completion" : {------->基于前缀完成搜索提示
"field" : "title.suggest"
}
}
}
}

五、动态映射模板：dynamic mapping template

1、es会自动映射

PUT /my_index/my_type/1
{
"test_string": "hello world",
"test_number": 10
}

es的自动的默认的，动态映射：
GET /my_index/_mapping/my_type
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"test_number": {
"type": "long"
},
"test_string": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}

2、dynamic mapping template 有两种：第一种，是根据新加入的field的默认的数据类型，来进行匹配，匹配上某个预定义的模板；第二种，是根据新加入的field的名字，去匹配预定义的名字，或者去匹配一个预定义的通配符，然后匹配上某个预定义的模板

3、根据field-type匹配映射模板

test_number，如果是个数字，我们希望默认就是integer类型的
test_string，如果是字符串，我们希望默认是个text，这个没问题，但是内置的field名字，叫做raw，不叫座keyword，类型还是keyword，保留500个字符

PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"integers": {------------------------->模板的名字
"match_mapping_type": "long",----------->将es默认识别为long的改为integer
"mapping": {
"type": "integer"
}
}
},
{
"strings": {------------------------->模板的名字
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 500
}
}
}
}
}
]
}
}
}

4、根据field-name匹配映射模板

PUT /my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"string_as_integer": {------------------------->模板的名字
"match_mapping_type": "string",
"match": "long_*",------------------------->用match和unmatch过滤掉field的名字
"unmatch": "*_text",
"mapping": {
"type": "integer"
}
}
}
]
}
}
}

六、geo_poing 地理位置数据类型

geo_point，就是一个地理位置坐标点，包含了一个经度，一个维度，经纬度，就可以唯一定位一个地球上的坐标

1、建立geo_point类型的mapping

PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point"------->一个地理位置坐标点
}
}
}
}
}

2、写入geo_point的3种方法

PUT my_index/my_type/1
{
"text": "Geo-point as an object",
"location": {
"lat": 41.12,------>latitude：维度
"lon": -71.34-------->longitude：经度
}
}

PUT my_index/my_type/2
{
"text": "Geo-point as a string",
"location": "41.12,-71.34"
}

PUT my_index/my_type/4
{
"text": "Geo-point as an array",
"location": [ -71.34, 41.12 ]
}

3、geo_bounding_box查询：查询某个矩形内的坐标

GET /my_index/my_type/_search
{
"query": {
"geo_bounding_box": {------>查询某个矩形内的坐标
"location": {--->属性名字
"top_left": {------>左上
"lat": 42,
"lon": -72
},
"bottom_right": {------>右下
"lat": 40,
"lon": -74
}
}
}
}
}

七、geo_poing案例

1、背景：指定两个地点，就要在东方明珠大厦和上海路组成的矩阵的范围内，搜索我想要的酒店

PUT /hotel_app
{
"mappings": {
"hotels": {
"properties": {
"pin": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
}
}

PUT /hotel_app/hotels/1
{
"name": "喜来登大酒店",
"pin" : {
"location" : {
"lat" : 40.12,
"lon" : -71.34
}
}
}

2、geo_bounding_box-------->矩形范围内搜索

GET /hotel_app/hotels/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_bounding_box": {---------->矩形
"pin.location": {------->属性名字
"top_left" : {
"lat" : 40.73,
"lon" : -74.1
},
"bottom_right" : {
"lat" : 40.01,
"lon" : -71.12
}
}
}
}
}
}
}

3、geo_polygon-------->多边形范围内搜索

GET /hotel_app/hotels/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_polygon": {---------->多边形
"pin.location": {------->属性名字
"points": [
{"lat" : 40.73, "lon" : -74.1},
{"lat" : 40.01, "lon" : -71.12},
{"lat" : 50.56, "lon" : -90.58}
]
}
}
}
}
}
}

4、geo_distance-------->搜索当前位置，方圆范围内的数据

GET /hotel_app/hotels/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_distance": {--------->搜索当前位置，方圆范围内的数据
"distance": "200km",
"pin.location": {--------->当前位置
"lat": 40,
"lon": -70
}
}
}
}
}
}

5、基于地理位置进行聚合分析

举例我0~100m有几个酒店，100m~300m有几个酒店，300m以上有几个酒店

GET /hotel_app/hotels/_search
{
"size": 0,
"aggs": {
"agg_by_distance_range": {
"geo_distance": {
"field": "pin.location",
"origin": {-------------->当前位置
"lat": 40,
"lon": -70
},
"distance_type": "sloppy_arc"--->sloppy_arc (the default), arc (most accurate) and plane (fastest)
"unit": "mi",
"ranges": [
{
"to": 100
},
{
"from": 100,
"to": 300
},
{
"from": 300
}
]
}
}
}
}

sunxj1222

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
10、es---高级用法

一、term vector1、term vector ：获取document中的某个field内的各个term的统计信息term information: term frequency in the field, term positions, start and end offsets, term payloadsterm statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; docu
复制链接

扫一扫