ES7.X 自定义分词+scroll查询

最新推荐文章于 2023-05-27 10:29:46 发布

PHPerJiang

最新推荐文章于 2023-05-27 10:29:46 发布

阅读量1k

点赞数

分类专栏： elasticsearch 文章标签：自定义分词 pinyin+ik scroll

本文链接：https://blog.csdn.net/qq_36558538/article/details/102860179

版权

elasticsearch 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

11月拉！

自定义分词

PUT user
{
  "settings": {
    "analysis": {
      "analyzer": {
        "pinyin_analyzer":{
          "tokenizer":"my_piniyin"
        }
      },
      "tokenizer": {
        "my_piniyin":{
          "type":"pinyin",
          "keep_full_pinyin":true,
          "keep_original":true,
          "limit_first_letter_length":16,
          "lowercase":true,
          "remove_duplicated_term":true,
          "keep_separate_first_letter":false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name":{
        "type": "keyword",
        "fields": {
          "my_pinyin":{
            "type":"text",
            "analyzer":"pinyin_analyzer"
          }
        }
      }
    }
  }
}

我们先创建一个索引，如上设置，settings设置好自定义索引，起名pinyin_analyzer, 标记是my_pinyin,设置pinyin分词器的各项元素，感觉比较重要的是keep_full_pinyin：true，汉语全量转拼音，具体的可以看文档https://github.com/medcl/elasticsearch-analysis-pinyin。接下来我们开始分词

{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "刘德华",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    }
  ]
}

看我们的pinyin分词已经将刘德华，分词了，还比较详细，使用term倒排查一下就出来，还是蛮好用的。

alias索引别名

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies",
        "alias": "myindex2",
        "filter": {
          "range": {
            "year": {
              "gte": 1
            }
          }
        }
      }
    }
  ]
}

在给一个索引添加别名的时候可以附加一个filter过滤，新的别名索引里只能查询到filter过滤后的docs

复合查询

给查询算分结果*某个字段的值，提升权重

POST movies/_search
{
  "explain": true, 
  "size": 2, 
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "Old",
          "fields": ["title","genre.keyword"]
        }
      },
      "field_value_factor": {
        "field":"year",
        "modifier": "log2p",    //分值追加一个函数  _score * log（2 + factor * year）
        "factor": 0.01          //增加函数进行收敛 
      }
    }
  }
}

如上是查询title、genre中带有old或者包含old的文档，并进行相关性打分，将打分结果*字段year的值，然后进行排序。

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 47,
      "relation" : "eq"
    },
    "max_score" : 9.856819,
    "hits" : [
      {
        "_shard" : "[movies][0]",
        "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "72696",
        "_score" : 9.856819,
        "_source" : {
          "year" : 2009,
          "genre" : [
            "Comedy"
          ],
          "@version" : "1",
          "id" : "72696",
          "title" : "Old Dogs"
        },
        "_explanation" : {
          "value" : 9.856819,
          "description" : "function score, product of:",
          "details" : [
            {
              "value" : 7.3328753,
              "description" : "max of:",
              "details" : [
                {
                  "value" : 7.3328753,
                  "description" : "weight(title:old in 14201) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 7.3328753,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.3534727,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 47,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 27287,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.5246147,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.9695094,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 1.3441957,
              "description" : "min of:",
              "details" : [
                {
                  "value" : 1.3441957,
                  "description" : "field value function: log2p(doc['year'].value * factor=0.01)",
                  "details" : [ ]
                },
                {
                  "value" : 3.4028235E38,
                  "description" : "maxBoost",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[movies][0]",
        "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "50259",
        "_score" : 9.852491,
        "_source" : {
          "year" : 2006,
          "genre" : [
            "Drama"
          ],
          "@version" : "1",
          "id" : "50259",
          "title" : "Old Joy"
        },
        "_explanation" : {
          "value" : 9.852491,
          "description" : "function score, product of:",
          "details" : [
            {
              "value" : 7.3328753,
              "description" : "max of:",
              "details" : [
                {
                  "value" : 7.3328753,
                  "description" : "weight(title:old in 11233) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 7.3328753,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.3534727,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 47,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 27287,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.5246147,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.9695094,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 1.3436055,
              "description" : "min of:",
              "details" : [
                {
                  "value" : 1.3436055,
                  "description" : "field value function: log2p(doc['year'].value * factor=0.01)",
                  "details" : [ ]
                },
                {
                  "value" : 3.4028235E38,
                  "description" : "maxBoost",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

我们看一下打分详情，即为 _score * log(2+ factor * year)

11.4更

提升分值 boost mode

POST movies/_search
{
  "explain": true, 
  "size": 2, 
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "Old",
          "fields": ["title","genre.keyword"]
        }
      },
      "field_value_factor": {
        "field": "year"
      }, 
      "boost_mode": "sum"
    }
  }
}

boost_mode 有四种模式

multiply : 将field_value_factor中获取的数值与query中的相关性打分做乘法运算，然后进行排序
sum: 算分与字段值因素的和
min/max : 算分与字段值因素之间取最大/最小值作为相关性打分

replace: 使用字段值因素取代算分

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 47,
      "relation" : "eq"
    },
    "max_score" : 2020.3269,
    "hits" : [
      {
        "_shard" : "[movies][0]",
        "_node" : "JZoUKVAzQkuhCZV5j8r4Qg",
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "114250",
        "_score" : 2020.3269,
        "_source" : {
          "year" : 2014,
          "genre" : [
            "Comedy",
            "Drama"
          ],
          "@version" : "1",
          "id" : "114250",
          "title" : "My Old Lady"
        },
        "_explanation" : {
          "value" : 2020.3269,
          "description" : "sum of",
          "details" : [
            {
              "value" : 6.3268967,
              "description" : "max of:",
              "details" : [
                {
                  "value" : 6.3268967,
                  "description" : "weight(title:old in 23775) [PerFieldSimilarity], result of:",
                  "details" : [
                    {
                      "value" : 6.3268967,
                      "description" : "score(freq=1.0), product of:",
                      "details" : [
                        {
                          "value" : 2.2,
                          "description" : "boost",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.3534727,
                          "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                          "details" : [
                            {
                              "value" : 47,
                              "description" : "n, number of documents containing term",
                              "details" : [ ]
                            },
                            {
                              "value" : 27287,
                              "description" : "N, total number of documents with field",
                              "details" : [ ]
                            }
                          ]
                        },
                        {
                          "value" : 0.4526441,
                          "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                          "details" : [
                            {
                              "value" : 1.0,
                              "description" : "freq, occurrences of term within document",
                              "details" : [ ]
                            },
                            {
                              "value" : 1.2,
                              "description" : "k1, term saturation parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 0.75,
                              "description" : "b, length normalization parameter",
                              "details" : [ ]
                            },
                            {
                              "value" : 3.0,
                              "description" : "dl, length of field",
                              "details" : [ ]
                            },
                            {
                              "value" : 2.9695094,
                              "description" : "avgdl, average length of field",
                              "details" : [ ]
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 2014.0,
              "description" : "min of:",
              "details" : [
                {
                  "value" : 2014.0,
                  "description" : "field value function: none(doc['year'].value * factor=1.0)",
                  "details" : [ ]
                },
                {
                  "value" : 3.4028235E38,
                  "description" : "maxBoost",
                  "details" : [ ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

从分析上来看，相关性的分6.3268967，而字段值因素是2014，所以总分是2020.3269

max_boost : 最大提升上限，此参数可以限制字段值因素的最大分值上限，所获取的分值将在这个上限范围内

POST movies/_search
{
  "explain": true, 
  "size": 1, 
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "Old",
          "fields": ["title","genre.keyword"]
        }
      },
      "field_value_factor": {
        "field": "year"
      }, 
      "boost_mode": "sum",
      "max_boost": 10
    }
  }
}

比如上面你的查询，field_value_factor的值会被限制在10（max_boost）内，最大10，因为boost_mode是sum，所以及果实查询的相关性打分加上这个字段值因素的最大值。

random_score 一致性随机函数

GET movies/_search
{
  "explain": true, 
  "size": 1, 
  "query": {
    "function_score": {
      "query": {
        "term": {
          "title": {
            "value": "love"
          }
        }
      },
      "random_score": {
        "seed": 314159265359,
        "field":"_seq_no"
      }
    }
  }
}

7.0之后需要random_score设置field字段，否则会报错，一致性随机函数是根据seed的的序号进行随机，如果seed的值是一样的，那么随机结果也是一致的。

suggest 推荐模块，原理是将查询分解为token，在索引字典里查找相似的term返回

GET movies/_search
{
  "size": 1, 
  "query": {
    "term": {
      "title": {
        "value": "lover"
      }
    }
  },
  "suggest": {
    "my_suggest": {
      "text": "lover",
      "term": {
        "field": "title",
        "suggest_mode":"popular"
      }
    }
  }
}

suggest_mode有几种常用的，比如

missing : 如果索引即terms => lover已经存在，则不提供建议
popular: 推荐出现频率更加高的词

always : 无论这个terms是否存在，都提供建议

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 12,
      "relation" : "eq"
    },
    "max_score" : 8.87367,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "2586",
        "_score" : 8.87367,
        "_source" : {
          "year" : 1999,
          "genre" : [
            "Comedy",
            "Crime",
            "Thriller"
          ],
          "@version" : "1",
          "id" : "2586",
          "title" : "Goodbye Lover"
        }
      }
    ]
  },
  "suggest" : {
    "my_suggest" : [
      {
        "text" : "lover",
        "offset" : 0,
        "length" : 5,
        "options" : [
          {
            "text" : "lovers",
            "score" : 0.8,
            "freq" : 25
          },
          {
            "text" : "loved",
            "score" : 0.8,
            "freq" : 14
          },
          {
            "text" : "love",
            "score" : 0.75,
            "freq" : 355
          },
          {
            "text" : "lives",
            "score" : 0.6,
            "freq" : 40
          },
          {
            "text" : "live",
            "score" : 0.5,
            "freq" : 72
          }
        ]
      }
    ]
  }
}

推荐的信息放在自定义的数组中，有分值及频率。需要的时候可以自选。

插播一条刚才遇到的问题。线上es报错查询超过1w条

我们先来了解一下es的配置index.max_result_window，es的配置，可以是全局的，也可以针对某个索引设置，默认1w条
线上引起这次报错的查询来源是什么呢，是一个脚本，while取数，每次20条，没有退出条件，在平时这个脚本不会引发es报错，因为平时数据量没双十一这么高，这几天大促，数据量持续走高，所以导致了超过配置限制。
如何解决这个问题呢？有几个思路，第一，因为他是脚本查询，不是前台实时查询，所以允许延迟时间，这样我们就可以采用es的scroll查询，scroll查询不是针对于实时的，它会对es进行多次查询，通过记录scroll_id+快照的方式进行查询，我们可以指定查询的时间间隔
```
curl -XGET 'localhost:9200/index/type/_search?scroll=1m' -d '
{
    "query": {
        "match_phase" : {
            "title" : "elasticsearch"
        }
    }
}
```
我们指定了scroll = 1min 即与下次查询之间最大间隔1min，超过则断联，第一次查询除了数据外还会返回一个scroll_id用作下次查询，所以下次查询就是如下查询
```
curl -XGET  'localhost:9200/_search/scroll'  -d'
{
    "scroll" : "1m", 
    "scroll_id" : "c2Nhbjs2OzM0NDg1ODpzRlBLc0FXNlNyNm5JWUc1" 
}
```
scroll会一直向指定查询游走，直到查询到对应数据或者查不到数据或者超时断联时会停止请求。但是只是用scroll进行查询是有代价的，它会进行排序，最坏的情况下是全局排序。
所以有些时候我们深度分页的情况下只想要数据，而不想排序，我们可以加上scan参数
```
GET /old_index/_search?search_type=scan&scroll=1m 
{
"query": { "match_all": {}},
"size": 1000
}
```
如上，我们只需加上search_type=scan，则可以禁止排序，从而避免全局排序。还有一种方式是使用_doc去sort得出来的结果，这个执行的效率最快，但是数据就不会有排序，适合用在只想取得所有数据的场景，示例如下
```
GET /old_index/_search?scroll=1m 
{
"query": { "match_all": {}},
"size": 1000,
"sort": [
        "_doc"
        ]
    }
}
```

另外一个优化点是，在使用scroll游标查询的时候，在查询完毕的时候尽可能的清除这个scroll,这样可以减轻es的负担

DELETE 127.0.0.1:9200/_search/scroll
{
    "scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAdsMqFmVkZTBJalJWUmp5UmI3V0FYc2lQbVEAAAAAAHbDKRZlZGUwSWpSVlJqeVJiN1dBWHNpUG1RAAAAAABpX2sWclBEekhiRVpSRktHWXFudnVaQ3dIQQAAAAAAaV9qFnJQRHpIYkVaUkZLR1lxbnZ1WkN3SEEAAAAAAGlfaRZyUER6SGJFWlJGS0dZcW52dVpDd0hB"
}

继续咱们的es学习，上面只是个小查取，等大促过去之后，我再对今天出现的问题做些优化。

PHPerJiang

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
ES7.X 自定义分词+scroll查询

11月拉！自定义分词 PUT user{ "settings": { "analysis": { "analyzer": { "pinyin_analyzer":{ "tokenizer":"my_piniyin" } }, "tokenizer": { "my_piniy...
复制链接

扫一扫

专栏目录