003-ElasticSearch搜索技术深入分析

SunriseYin

已于 2023-06-28 23:21:45 修改

阅读量112

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch 算法

于 2023-06-27 23:20:31 首次发布

本文链接：https://blog.csdn.net/yin10484039/article/details/131427003

版权

elasticsearch 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

搜索技术深入分析

算分

TF-IDF

TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。
在这里插入图片描述

TF是词频(Term Frequency)
检索词在文档中出现的频率越高，相关性也越高。
IDF是逆向文本频率(Inverse Document Frequency)
每个检索词在索引中出现的频率，频率越高，相关性越低。
字段长度归一值（ field-length norm）
字段的长度是多少？字段越短，字段的权重越高。检索词出现在一个内容短的 title 要比同样的词出现在一个内容长的 content 字段权重更大。

以上三个因素——词频（term frequency）、逆向文档频率（inverse document frequency）和字段长度归一值（field-length norm）——是在索引时计算并存储的，最后将它们结合在一起计算单个词在特定文档中的权重。

BM25

BM25 就是对 TF-IDF 算法的改进，对于 TF-IDF 算法，TF(t) 部分的值越大，整个公式返回的值就会越大。
BM25 就针对这点进行来优化，随着TF(t) 的逐步加大，该算法的返回值会趋于一个数值。
在这里插入图片描述

查看算分计划

GET /test_score/_search
{
  "explain": true, 
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  }
}

Boosting

Boosting是控制相关度的一种手段
因为公式中 * Boosting得到算分
所以
当boost > 1时，打分的权重相对性提升
当0 < boost <1时，打分的权重相对性降低
当boost <0时，贡献负分

GET /test_score/_search
{
  "query": {
    "boosting": {
      "positive": {
        "term": {
          "content": "elasticsearch"
        }
      },
      "negative": {
         "term": {
            "content": "like"
          }
      },
      "negative_boost": 0.2
    }
  }
}

布尔查询bool Query

一个bool查询,是一个或者多个查询子句的组合
有4个参数
1. must: 相当于&& ，必须匹配，贡献算分
2. should: 相当于|| ，选择性匹配，贡献算分
3. must_not: 相当于! ，必须不能匹配，不贡献算分
4. filter: 必须匹配，不贡献算法

语法

子查询可以任意顺序出现
可以嵌套多个查询

解决结构化查询“包含而不是相等”的问题

增加count字段，使用bool查询解决

从业务角度，按需改进Elasticsearch数据模型
POST /employee/_bulk
{“index”:{“_id”:1}}
{“name”:“小明”,“interest”:[“跑步”,“篮球”],“interest_count”:2}
{“index”:{“_id”:2}}
{“name”:“小红”,“interest”:[“跑步”],“interest_count”:1}
{“index”:{“_id”:3}}
{“name”:“小丽”,“interest”:[“跳舞”,“唱歌”,“跑步”],“interest_count”:3}
使用bool查询

# must 算分
POST /employee/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "interest.keyword": {
              "value": "跑步"
            }
          }
        },
        {
          "term": {
            "interest_count": {
              "value": 1
            }
          }
        }
      ]
    }
  }
}
# filter不算分
POST /employee/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "interest.keyword": {
              "value": "跑步"
            }
          }
        },
        {
          "term": {
            "interest_count": {
              "value": 1
            }
          }
        }
      ]
    }
  }
}

利用bool嵌套实现should not逻辑

GET /es_db/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "remark": "java developer"
        }
      },
      "should": [
        {
          "bool": {
            "must_not": [
              {
                "term": {
                  "sex": 1
                }
              }
            ]
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

单字符串多字段查询

三种场景

最佳字段(Best Fields)
当字段之间相互竞争，又相互关联。例如，对于博客的 title和 body这样的字段，评分来自最匹配字段
多数字段(Most Fields)
处理英文内容时的一种常见的手段是，在主字段( English Analyzer)，抽取词干，加入同义词，以
匹配更多的文档。相同的文本，加入子字段（Standard Analyzer），以提供更加精确的匹配。其他字段作为匹配文档提高相关度的信号，匹配字段越多则越好。
混合字段(Cross Field)
对于某些实体，例如人名，地址，图书信息。需要在多个字段中确定信息，单个字段只能作为整体的一部分。希望在任何这些列出的字段中找到尽可能多的词

最佳字段

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}

通过tie_breaker参数调整

Tier Breaker是一个介于0-1之间的浮点数。0代表使用最佳匹配;1代表所有语句同等重要。

获得最佳匹配语句的评分_score 。
将其他匹配语句的评分与tie_breaker相乘
对以上评分求和并规范化

POST /blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}


POST /blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

最佳字段

POST /blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields",
      "query": "Quick pets",
      "fields": ["title","body"],
      "tie_breaker": 0.2
    }
  }
}

使用多数字段

PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "std": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      }
    }
  }
}

POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }

# 结果与预期不匹配
GET /titles/_search
{
  "query": {
    "match": {
      "title": "barking dogs"
    }
  }
}

跨字段搜索

PUT /address/_bulk
{ "index": { "_id": "1"} }
{"province": "湖南","city": "长沙"}
{ "index": { "_id": "2"} }
{"province": "湖南","city": "常德"}
{ "index": { "_id": "3"} }
{"province": "广东","city": "广州"}
{ "index": { "_id": "4"} }
{"province": "湖南","city": "邵阳"}

#使用most_fields的方式结果不符合预期，不支持operator
GET /address/_search
{
  "query": {
    "multi_match": {
      "query": "湖南常德",
      "type": "most_fields",
      "fields": ["province","city"]
    }
  }
}

ElasticSearch聚合操作

Elasticsearch除搜索以外，提供了针对ES 数据进行统计分析的功能。
聚合(aggregations)可以让我们极其方便的实现对数据的统计、分析、运算。
例如：
什么品牌的手机最受欢迎？
这些手机的平均价格、最高价格、最低价格？
这些手机每月的销售情况如何？
语法：

aggs" : {  #和query同级的关键词
    "<aggregation_name>" : { #自定义的聚合名字
        "<aggregation_type>" : { #聚合的定义： 不同的type+body
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
        [,"aggregations" : { [<sub_aggregation>]+ } ]?  #子聚合查询
    }
    [,"<aggregation_name_2>" : { ... } ]*  #可以包含多个同级的聚合查询
}