一、term vector
1、term vector :获取document中的某个field内的各个term的统计信息
term information: term frequency in the field, term positions, start and end offsets, term payloads
term statistics: 设置term_statistics=true; total term frequency, 一个term在所有document中出现的频率; document frequency,有多少document包含这个term
field statistics: document count,有多少document包含这个field; sum of document frequency,一个field中所有term的df之和; sum of total term frequency,一个field中的所有term的tf之和
2、index-time & query-time
(1)index-time,你在mapping里配置一下,然后建立索引的时候,就直接给你生成这些term和field的统计信息了
(2)query-time,你之前没有生成过任何的Term vector信息,然后在查看term vector的时候,直接就可以看到了。现场计算出各种统计信息,然后返回给你。
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets_payloads",------>index-time
"store" : true,
"analyzer" : "fulltext_analyzer"
},
"fullname": {----->query-time
"type": "text",
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
3、实验
(1)存入数据
PUT /my_index/my_type/1
{
"fullname" : "Leo Li",
"text" : "hello test test test "
}
PUT /my_index/my_type/2
{
"fullname" : "Leo Li",
"text" : "other hello test ..."
}
(2)查看term信息
GET /my_index/my_type/1/_termvectors------------>查看某个field下term的统计信息
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
(3)查询结果说明
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"_version": 1,
"found": true,
"took": 10,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 6,-------------------->所有doc的所有term的doc_freq相加
"doc_count": 2,--------------------->这个field存在于几个doc中
"sum_ttf": 8-------------------->所有doc的所有term的ttf相加
},
"terms": {
"hello": {--------------------->term
"doc_freq": 2,--------------------->有多少个doc包含这个term
"ttf": 2,--------------------->这个term在所有doc中出现的次数
"term_freq": 1,--------------------->这个term在当前这个doc的这个field出现的次数
"tokens": [--------------------->这个term在当前这个doc的这个field出现的每一次叫一个token
{
"position": 0,--------------------->这个term在当前这个doc的这个field的位置
"start_offset": 0,--------------------->这个term在当前这个doc的这个field的位置
"end_offset": 5,--------------------->这个term在当前这个doc的这个field的位置
"payload": "d29yZA=="--------------------->这个term的编码
}
]
},
"test": {
"doc_freq": 2,
"ttf": 4,
"term_freq": 3,
"tokens": [
{
"position": 1,
"start_offset": 6,
"end_offset": 10,
"payload": "d29yZA=="
},
{
"position": 2,
"start_offset": 11,
"end_offset": 15,
"payload": "d29yZA=="
},
{
"position": 3,
"start_offset": 16,
"end_offset": 20,
"payload": "d29yZA=="
}
]
}
}
}
}
}
4、手动指定doc的term vector---->手动指定要探查的term的数据情况,计算它在现有的所有doc中的一些统计信息
GET /my_index/my_type/_termvectors
{
"doc" : {
"fullname" : "Leo Li",
"text" : "hello test test test"
},
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
5、multi term vector
GET _mtermvectors
{
"docs": [
{
"_index": "my_index",
"_type": "my_type",
"_id": "2",
"term_statistics": true
},
{
"_index": "my_index",
"_type": "my_type",
"_id": "1",
"fields": [
"text"
]
}
]
}
二、highlight
1、highlight中的field,必须跟query中的field一一对齐的
2、例如:
PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}
分词器测试:
GET _analyze
{
"text":"我发表的第一篇博客"
"analyzer":"ik_max_word"
}
PUT /blog_website/blogs/1
{
"title": "我的第一篇博客",
"content": "大家好,这是我写的第一篇博客,特别喜欢这个博客网站!!!"
}
GET /blog_website/blogs/_search
{
"query": {
"match": {
"title": "博客"
}
},
"highlight": {------->与上面的搜索对应,将title中搜索词进行高亮
"fields": {
"title": {}
}
}
}
3、三种highlight
(1)、plain highlight 默认值
(2)、posting highlight
性能比plain highlight要高,因为不需要重新对高亮文本进行分词
对磁盘的消耗更少
将文本切割为句子,并且对句子进行高亮,效果更好
PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"index_options": "offsets"--------->posting highlight
}
}
}
}
}
(3)、fast vector highlight
对大field而言(大于1mb),性能更高
PUT /blog_website
{
"mappings": {
"blogs": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"term_vector" : "with_positions_offsets"--------------->index-time打开term-vector
}
}
}
}
}
4、强制使用某种highlight
GET /blog_website/blogs/_search
{
"query": {
"match": {
"content": "博客"
}
},
"highlight": {
"fields": {
"content": {
"type": "plain"----------->强制使用某种highlighter
}
}
}
}
5、总结
般情况下,用plain highlight也就足够了,不需要做其他额外的设置
如果对高亮的性能要求很高,可以尝试启用posting highlight
如果field的值特别大,超过了1M,那么可以用fast vector highlight
三、template
1、说明
a、{{xx}}:参数名
b、关键词:inline
2、例1:入门
GET /blog_website/blogs/_search/template
{
"inline" : {
"query": {
"match" : {
"{{field}}" : "{{value}}" --------->这个就是template
}
}
},
"params" : {
"field" : "title",
"value" : "博客"
}
}
底层翻译为:
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客"
}
}
}
3、例2:toJson
2、toJson
GET /blog_website/blogs/_search/template
{
"inline": "{\"query\": {\"match\": {{#toJson}}matchCondition{{/toJson}}}}",---->将json对象生成json串
"params": {
"matchCondition": {
"title": "博客"
}
}
}
底层翻译为:
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客"
}
}
}
4、例3:join
GET /blog_website/blogs/_search/template
{
"inline": {
"query": {
"match": {
"title": "{{#join delimiter=' '}}titles{{/join delimiter=' '}}"
}
}
},
"params": {
"titles": ["博客", "网站"]
}
}
底层翻译为:
GET /blog_website/blogs/_search
{
"query": {
"match" : {
"title" : "博客 网站"
}
}
}
5、例4:default value
GET /blog_website/blogs/_search/template
{
"inline": {
"query": {
"range": {
"views": {
"gte": "{{start}}",
"lte": "{{end}}{{^end}}20{{/end}}"--------->默认值是20
}
}
}
},
"params": {
"start": 1,
"end": 10
}
}
6、例5:conditional
es的config/scripts目录下,预先保存这个复杂的模板,后缀名是.mustache,文件名是conditonal
{
"query": {
"bool": {
"must": {
"match": {
"line": "{{text}}"
}
},
"filter": {
{{#line_no}} ------------------>条件中指定了line_no,才使用
"range": {
"line_no": {
{{#start}}
"gte": "{{start}}"
{{#end}},{{/end}}
{{/start}}
{{#end}}
"lte": "{{end}}"
{{/end}}
}
}
{{/line_no}}
}
}
}
}
GET /my_index/my_type/_search/template
{
"file": "conditional",
"params": {
"text": "博客",
"line_no": true,
"start": 1,
"end": 10
}
}
7、使用场景:
比如说,一般在大型的团队中,可能不同的人,都会想要执行一些类似的搜索操作
这个时候,有一些负责底层运维的一些同学,就可以基于search template,封装一些模板出来,然后是放在各个es进程的scripts目录下的
其他的团队,其实就不用各个团队自己反复手写复杂的通用的查询语句了,直接调用某个搜索模板,传入一些参数就好了
四、completion suggest
1、suggest,completion suggest,自动完成,搜索推荐,搜索提示 --> 自动完成,auto completion
2、completion 类型:es实现的时候,是非常高性能的,会构建不是倒排索引,也不是正拍索引,就是纯的用于进行前缀搜索的一种特殊的数据结构,而且会全部放在内存中,所以auto completion进行的前缀搜索提示,性能是非常高的
PUT /news_website
{
"mappings": {
"news" : {
"properties" : {
"title" : {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"suggest" : {
"type" : "completion",---------->基于前缀完成搜索提示
"analyzer": "ik_max_word"
}
}
},
"content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}
搜索:
GET /news_website/news/_search
{
"suggest": {
"my-suggest" : {
"prefix" : "大话西游",
"completion" : {------->基于前缀完成搜索提示
"field" : "title.suggest"
}
}
}
}
五、动态映射模板:dynamic mapping template
1、es会自动映射
PUT /my_index/my_type/1
{
"test_string": "hello world",
"test_number": 10
}
es的自动的默认的,动态映射:
GET /my_index/_mapping/my_type
{
"my_index": {
"mappings": {
"my_type": {
"properties": {
"test_number": {
"type": "long"
},
"test_string": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
2、dynamic mapping template 有两种:第一种,是根据新加入的field的默认的数据类型,来进行匹配,匹配上某个预定义的模板;第二种,是根据新加入的field的名字,去匹配预定义的名字,或者去匹配一个预定义的通配符,然后匹配上某个预定义的模板
3、根据field-type匹配映射模板
test_number,如果是个数字,我们希望默认就是integer类型的
test_string,如果是字符串,我们希望默认是个text,这个没问题,但是内置的field名字,叫做raw,不叫座keyword,类型还是keyword,保留500个字符
PUT my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"integers": {------------------------->模板的名字
"match_mapping_type": "long",----------->将es默认识别为long的改为integer
"mapping": {
"type": "integer"
}
}
},
{
"strings": {------------------------->模板的名字
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 500
}
}
}
}
}
]
}
}
}
4、根据field-name匹配映射模板
PUT /my_index
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"string_as_integer": {------------------------->模板的名字
"match_mapping_type": "string",
"match": "long_*",------------------------->用match和unmatch过滤掉field的名字
"unmatch": "*_text",
"mapping": {
"type": "integer"
}
}
}
]
}
}
}
六、geo_poing 地理位置数据类型
geo_point,就是一个地理位置坐标点,包含了一个经度,一个维度,经纬度,就可以唯一定位一个地球上的坐标
1、建立geo_point类型的mapping
PUT /my_index
{
"mappings": {
"my_type": {
"properties": {
"location": {
"type": "geo_point"------->一个地理位置坐标点
}
}
}
}
}
2、写入geo_point的3种方法
PUT my_index/my_type/1
{
"text": "Geo-point as an object",
"location": {
"lat": 41.12,------>latitude:维度
"lon": -71.34-------->longitude:经度
}
}
PUT my_index/my_type/2
{
"text": "Geo-point as a string",
"location": "41.12,-71.34"
}
PUT my_index/my_type/4
{
"text": "Geo-point as an array",
"location": [ -71.34, 41.12 ]
}
3、geo_bounding_box查询:查询某个矩形内的坐标
GET /my_index/my_type/_search
{
"query": {
"geo_bounding_box": {------>查询某个矩形内的坐标
"location": {--->属性名字
"top_left": {------>左上
"lat": 42,
"lon": -72
},
"bottom_right": {------>右下
"lat": 40,
"lon": -74
}
}
}
}
}
七、geo_poing案例
1、背景:指定两个地点,就要在东方明珠大厦和上海路组成的矩阵的范围内,搜索我想要的酒店
PUT /hotel_app
{
"mappings": {
"hotels": {
"properties": {
"pin": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
}
}
PUT /hotel_app/hotels/1
{
"name": "喜来登大酒店",
"pin" : {
"location" : {
"lat" : 40.12,
"lon" : -71.34
}
}
}
2、geo_bounding_box-------->矩形范围内搜索
GET /hotel_app/hotels/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_bounding_box": {---------->矩形
"pin.location": {------->属性名字
"top_left" : {
"lat" : 40.73,
"lon" : -74.1
},
"bottom_right" : {
"lat" : 40.01,
"lon" : -71.12
}
}
}
}
}
}
}
3、geo_polygon-------->多边形范围内搜索
GET /hotel_app/hotels/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_polygon": {---------->多边形
"pin.location": {------->属性名字
"points": [
{"lat" : 40.73, "lon" : -74.1},
{"lat" : 40.01, "lon" : -71.12},
{"lat" : 50.56, "lon" : -90.58}
]
}
}
}
}
}
}
4、geo_distance-------->搜索当前位置,方圆范围内的数据
GET /hotel_app/hotels/_search
{
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_distance": {--------->搜索当前位置,方圆范围内的数据
"distance": "200km",
"pin.location": {--------->当前位置
"lat": 40,
"lon": -70
}
}
}
}
}
}
5、基于地理位置进行聚合分析
举例我0~100m有几个酒店,100m~300m有几个酒店,300m以上有几个酒店
GET /hotel_app/hotels/_search
{
"size": 0,
"aggs": {
"agg_by_distance_range": {
"geo_distance": {
"field": "pin.location",
"origin": {-------------->当前位置
"lat": 40,
"lon": -70
},
"distance_type": "sloppy_arc"--->sloppy_arc (the default), arc (most accurate) and plane (fastest)
"unit": "mi",
"ranges": [
{
"to": 100
},
{
"from": 100,
"to": 300
},
{
"from": 300
}
]
}
}
}
}