11月拉!
- 自定义分词
PUT user { "settings": { "analysis": { "analyzer": { "pinyin_analyzer":{ "tokenizer":"my_piniyin" } }, "tokenizer": { "my_piniyin":{ "type":"pinyin", "keep_full_pinyin":true, "keep_original":true, "limit_first_letter_length":16, "lowercase":true, "remove_duplicated_term":true, "keep_separate_first_letter":false } } } }, "mappings": { "properties": { "name":{ "type": "keyword", "fields": { "my_pinyin":{ "type":"text", "analyzer":"pinyin_analyzer" } } } } } }
我们先创建一个索引,如上设置,settings设置好自定义索引,起名pinyin_analyzer, 标记是my_pinyin,设置pinyin分词器的各项元素,感觉比较重要的是keep_full_pinyin:true, 汉语全量转拼音,具体的可以看文档https://github.com/medcl/elasticsearch-analysis-pinyin。接下来我们开始分词
{ "tokens" : [ { "token" : "liu", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "刘德华", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "ldh", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "de", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "hua", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 } ] }
看我们的pinyin分词已经将刘德华,分词了,还比较详细,使用term倒排查一下就出来,还是蛮好用的。
-
alias索引别名
POST _aliases { "actions": [ { "add": { "index": "movies", "alias": "myindex2", "filter": { "range": { "year": { "gte": 1 } } } } } ] }
在给一个索引添加别名的时候可以附加一个filter过滤,新的别名索引里只能查询到filter过滤后的docs
-
复合查询
-
给查询算分结果*某个字段的值,提升权重
POST movies/_search { "explain": true, "size": 2, "query": { "function_score": { "query": { "multi_match": { "query": "Old", "fields": ["title","genre.keyword"] } }, "field_value_factor": { "field":"year", "modifier": "log2p", //分值追加一个函数 _score * log(2 + factor * year) "factor": 0.01 //增加函数进行收敛 } } } }
如上是查询title、genre中带有old或者包含old的文档,并进行相关性打分,将打分结果*字段year的值,然后进行排序。
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 47, "relation" : "eq" }, "max_score" : 9.856819, "hits" : [ { "_shard" : "[movies][0]", "_node" : "JZoUKVAzQkuhCZV5j8r4Qg", "_index" : "movies", "_type" : "_doc", "_id" : "72696", "_score" : 9.856819, "_source" : { "year" : 2009, "genre" : [ "Comedy" ], "@version" : "1", "id" : "72696", "title" : "Old Dogs" }, "_explanation" : { "value" : 9.856819, "description" : "function score, product of:", "details" : [ { "value" : 7.3328753, "description" : "max of:", "details" : [ { "value" : 7.3328753, "description" : "weight(title:old in 14201) [PerFieldSimilarity], result of:", "details" : [ { "value" : 7.3328753, "description" : "score(freq=1.0), product of:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 6.3534727, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 47, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 27287, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.5246147, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 2.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 2.9695094, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] }, { "value" : 1.3441957, "description" : "min of:", "details" : [ { "value" : 1.3441957, "description" : "field value function: log2p(doc['year'].value * factor=0.01)", "details" : [ ] }, { "value" : 3.4028235E38, "description" : "maxBoost", "details" : [ ] } ] } ] } }, { "_shard" : "[movies][0]", "_node" : "JZoUKVAzQkuhCZV5j8r4Qg", "_index" : "movies", "_type" : "_doc", "_id" : "50259", "_score" : 9.852491, "_source" : { "year" : 2006, "genre" : [ "Drama" ], "@version" : "1", "id" : "50259", "title" : "Old Joy" }, "_explanation" : { "value" : 9.852491, "description" : "function score, product of:", "details" : [ { "value" : 7.3328753, "description" : "max of:", "details" : [ { "value" : 7.3328753, "description" : "weight(title:old in 11233) [PerFieldSimilarity], result of:", "details" : [ { "value" : 7.3328753, "description" : "score(freq=1.0), product of:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 6.3534727, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 47, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 27287, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.5246147, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 2.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 2.9695094, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] }, { "value" : 1.3436055, "description" : "min of:", "details" : [ { "value" : 1.3436055, "description" : "field value function: log2p(doc['year'].value * factor=0.01)", "details" : [ ] }, { "value" : 3.4028235E38, "description" : "maxBoost", "details" : [ ] } ] } ] } } ] } }
我们看一下打分详情,即为 _score * log(2+ factor * year)
11.4更
- 提升分值 boost mode
POST movies/_search { "explain": true, "size": 2, "query": { "function_score": { "query": { "multi_match": { "query": "Old", "fields": ["title","genre.keyword"] } }, "field_value_factor": { "field": "year" }, "boost_mode": "sum" } } }
boost_mode 有四种模式
-
multiply : 将field_value_factor中获取的数值与query中的相关性打分做乘法运算,然后进行排序
-
sum: 算分与字段值因素的和
-
min/max : 算分与字段值因素之间取最大/最小值作为相关性打分
-
replace: 使用字段值因素取代算分
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 47, "relation" : "eq" }, "max_score" : 2020.3269, "hits" : [ { "_shard" : "[movies][0]", "_node" : "JZoUKVAzQkuhCZV5j8r4Qg", "_index" : "movies", "_type" : "_doc", "_id" : "114250", "_score" : 2020.3269, "_source" : { "year" : 2014, "genre" : [ "Comedy", "Drama" ], "@version" : "1", "id" : "114250", "title" : "My Old Lady" }, "_explanation" : { "value" : 2020.3269, "description" : "sum of", "details" : [ { "value" : 6.3268967, "description" : "max of:", "details" : [ { "value" : 6.3268967, "description" : "weight(title:old in 23775) [PerFieldSimilarity], result of:", "details" : [ { "value" : 6.3268967, "description" : "score(freq=1.0), product of:", "details" : [ { "value" : 2.2, "description" : "boost", "details" : [ ] }, { "value" : 6.3534727, "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:", "details" : [ { "value" : 47, "description" : "n, number of documents containing term", "details" : [ ] }, { "value" : 27287, "description" : "N, total number of documents with field", "details" : [ ] } ] }, { "value" : 0.4526441, "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:", "details" : [ { "value" : 1.0, "description" : "freq, occurrences of term within document", "details" : [ ] }, { "value" : 1.2, "description" : "k1, term saturation parameter", "details" : [ ] }, { "value" : 0.75, "description" : "b, length normalization parameter", "details" : [ ] }, { "value" : 3.0, "description" : "dl, length of field", "details" : [ ] }, { "value" : 2.9695094, "description" : "avgdl, average length of field", "details" : [ ] } ] } ] } ] } ] }, { "value" : 2014.0, "description" : "min of:", "details" : [ { "value" : 2014.0, "description" : "field value function: none(doc['year'].value * factor=1.0)", "details" : [ ] }, { "value" : 3.4028235E38, "description" : "maxBoost", "details" : [ ] } ] } ] } } ] } }
从分析上来看,相关性的分6.3268967,而字段值因素是2014,所以总分是2020.3269
-
-
max_boost : 最大提升上限,此参数可以限制字段值因素的最大分值上限,所获取的分值将在这个上限范围内
-
POST movies/_search { "explain": true, "size": 1, "query": { "function_score": { "query": { "multi_match": { "query": "Old", "fields": ["title","genre.keyword"] } }, "field_value_factor": { "field": "year" }, "boost_mode": "sum", "max_boost": 10 } } }
比如上面你的查询,field_value_factor的值会被限制在10(max_boost)内,最大10,因为boost_mode是sum,所以及果实查询的相关性打分加上这个字段值因素的最大值。
-
random_score 一致性随机函数
GET movies/_search { "explain": true, "size": 1, "query": { "function_score": { "query": { "term": { "title": { "value": "love" } } }, "random_score": { "seed": 314159265359, "field":"_seq_no" } } } }
7.0之后需要random_score设置field字段,否则会报错,一致性随机函数是根据seed的的序号进行随机,如果seed的值是一样的,那么随机结果也是一致的。
-
suggest 推荐模块,原理是将查询分解为token,在索引字典里查找相似的term返回
GET movies/_search { "size": 1, "query": { "term": { "title": { "value": "lover" } } }, "suggest": { "my_suggest": { "text": "lover", "term": { "field": "title", "suggest_mode":"popular" } } } }
suggest_mode有几种常用的,比如
-
missing : 如果索引即terms => lover已经存在,则不提供建议
-
popular: 推荐出现频率更加高的词
-
always : 无论这个terms是否存在,都提供建议
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 12, "relation" : "eq" }, "max_score" : 8.87367, "hits" : [ { "_index" : "movies", "_type" : "_doc", "_id" : "2586", "_score" : 8.87367, "_source" : { "year" : 1999, "genre" : [ "Comedy", "Crime", "Thriller" ], "@version" : "1", "id" : "2586", "title" : "Goodbye Lover" } } ] }, "suggest" : { "my_suggest" : [ { "text" : "lover", "offset" : 0, "length" : 5, "options" : [ { "text" : "lovers", "score" : 0.8, "freq" : 25 }, { "text" : "loved", "score" : 0.8, "freq" : 14 }, { "text" : "love", "score" : 0.75, "freq" : 355 }, { "text" : "lives", "score" : 0.6, "freq" : 40 }, { "text" : "live", "score" : 0.5, "freq" : 72 } ] } ] } }
推荐的信息放在自定义的数组中,有分值及频率。需要的时候可以自选。
-
插播一条刚才遇到的问题。线上es报错查询超过1w条
- 我们先来了解一下es的配置index.max_result_window,es的配置,可以是全局的,也可以针对某个索引设置,默认1w条
- 线上引起这次报错的查询来源是什么呢,是一个脚本,while取数,每次20条,没有退出条件,在平时这个脚本不会引发es报错,因为平时数据量没双十一这么高,这几天大促,数据量持续走高,所以导致了超过配置限制。
- 如何解决这个问题呢?有几个思路,第一,因为他是脚本查询,不是前台实时查询,所以允许延迟时间,这样我们就可以采用es的scroll查询,scroll查询不是针对于实时的,它会对es进行多次查询,通过记录scroll_id+快照的方式进行查询,我们可以指定查询的时间间隔
curl -XGET 'localhost:9200/index/type/_search?scroll=1m' -d ' { "query": { "match_phase" : { "title" : "elasticsearch" } } }
我们指定了scroll = 1min 即与下次查询之间最大间隔1min,超过则断联,第一次查询除了数据外还会返回一个scroll_id用作下次查询,所以下次查询就是如下查询
curl -XGET 'localhost:9200/_search/scroll' -d' { "scroll" : "1m", "scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" }
scroll会一直向指定查询游走,直到查询到对应数据或者查不到数据或者超时断联时会停止请求。但是只是用scroll进行查询是有代价的,它会进行排序,最坏的情况下是全局排序。
-
所以有些时候我们深度分页的情况下只想要数据,而不想排序,我们可以加上scan参数
GET /old_index/_search?search_type=scan&scroll=1m { "query": { "match_all": {}}, "size": 1000 }
如上,我们只需加上search_type=scan,则可以禁止排序,从而避免全局排序。还有一种方式是使用_doc去sort得出来的结果,这个执行的效率最快,但是数据就不会有排序,适合用在只想取得所有数据的场景,示例如下
GET /old_index/_search?scroll=1m { "query": { "match_all": {}}, "size": 1000, "sort": [ "_doc" ] } }
-
另外一个优化点是,在使用scroll游标查询的时候,在查询完毕的时候尽可能的清除这个scroll,这样可以减轻es的负担
DELETE 127.0.0.1:9200/_search/scroll { "scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAdsMqFmVkZTBJalJWUmp5UmI3V0FYc2lQbVEAAAAAAHbDKRZlZGUwSWpSVlJqeVJiN1dBWHNpUG1RAAAAAABpX2sWclBEekhiRVpSRktHWXFudnVaQ3dIQQAAAAAAaV9qFnJQRHpIYkVaUkZLR1lxbnZ1WkN3SEEAAAAAAGlfaRZyUER6SGJFWlJGS0dZcW52dVpDd0hB" }
继续咱们的es学习,上面只是个小查取,等大促过去之后,我再对今天出现的问题做些优化。