ElasticSearch那些事儿（五）

最新推荐文章于 2024-05-16 21:19:43 发布

Computer_hello

最新推荐文章于 2024-05-16 21:19:43 发布

阅读量204

点赞数

分类专栏： Elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/computer_hello/article/details/107672883

版权

Elasticsearch 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

基于词项和全文的搜索

1. 基于 Term 的查询

Term 的重要性
- Term 是表达语意的最⼩单位。搜索和利⽤统计语⾔模型进⾏⾃然语⾔处理都需要处理 Term
特点
- Term Level Query: Term Query / Range Query / Exists Query / Prefix Query /Wildcard Query
- 在 ES 中，Term 查询，对输⼊不做分词。会将输⼊作为⼀个整体，在倒排索引中查找准确的词项，并且使⽤相关度算分公式为每个包含该词项的⽂档进⾏相关度算分 – 例如“Apple Store”
- 可以通过 Constant Score 将查询转换成⼀个 Filtering，避免算分，并利⽤缓存，提⾼性

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }

几个查询的结果分别是什么?
如果搜不不到，为什么?
应该如何解决

GET /products

POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        //"value": "iPhone" //查不到结果
        "value":"iphone" //可以查到结果
      }
    }
  }
}

POST /products/_search
{
  "query": {
    "term": {
      "desc.keyword": {
        "value": "iPhone" //可以查到结果
        //"value":"iphone" //查不到结果
      }
    }
  }
}


POST /products/_search
{
  "query": {
    "term": {
      "productID": {
        "value": "XHDK-A-1293-#fJ3" //查不到结果
        //"value": "xhdk" //可以查到结果,根据分词分析
        //"value": "xhdk-a-1293-#fJ3" //查不到结果
      }
    }
  }
}

POST /products/_search
{
  //"explain": true,
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"//可以查到结果
      }
    }
  }
}

//查看分词结果
POST /_analyze
{
 "analyzer": "standard",
 "text": ["XHDK-A-1293-#fJ3"]
}

//res
{
  "tokens" : [
    {
      "token" : "xhdk",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "1293",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "fj3",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

多字段 Mapping 和 Term查询

GET products/_mapping

//res
{
  "products" : {
    "mappings" : {
      "properties" : {
        "desc" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "productID" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

使用keyword关键字进行查询,严格匹配
term查询会返回算分结果

复合查询 – Constant Score 转为 Filter

将 Query 转成 Filter，忽略 TF-IDF 计算，避免相关性算分的开销
Filter 可以有效利⽤缓存

POST /products/_search
{
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }
    }
  }
}

//res
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_shard" : "[products][0]",
        "_node" : "BsfHcVuGT8-7CROZ1odZUg",
        "_index" : "products",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "productID" : "XHDK-A-1293-#fJ3",
          "desc" : "iPhone"
        },
        "_explanation" : {
          "value" : 1.0,
          "description" : "ConstantScore(productID.keyword:XHDK-A-1293-#fJ3)",
          "details" : [ ]
        }
      }
    ]
  }
}

基于全⽂的查询

基于全⽂本的查找
- Match Query / Match Phrase Query / Query String Query
特点
索引和搜索时都会进⾏分词，查询字符串先传递到⼀个合适的分词器，然后⽣成⼀个供查询的词项列表
查询时候，先会对输⼊的查询进⾏分词，然后每个词项逐个进⾏底层的查询，最终将结果进⾏合并。并为每个⽂档⽣成⼀个算分。

例如查 “Matrix reloaded”，会查到包括Matrix 或者 reload的所有结果。

Match Query Result

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reload" // or
      }
    }
  }
}

//res
"hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "2571",
        "_score" : 9.095142, //返回相关的算分结果
        "_source" : {
          "genre" : [
            "Action",
            "Sci-Fi",
            "Thriller"
          ],
          "title" : "Matrix, The",
          "year" : 1999,
          "@version" : "1",
          "id" : "2571"
        }
      }
    ]

Operator

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reload"
        , "operator": "and" //精准筛选
      }
    }
  }
}

//res
"profile" : {
    "shards" : [
      {
        "id" : "[QG8Co41UQGKuwzGrkvpzOA][movies][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "+title:matrix +title:reload",//精准筛选
                "time_in_nanos" : 2900408,

Minimum_should_match

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match": {
      "title": {
        "query": "Matrix reload",
        "minimum_should_match": 2
      }
    }
  }
}

//res
"profile" : {
    "shards" : [
      {
        "id" : "[BsfHcVuGT8-7CROZ1odZUg][movies][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "(title:matrix title:reload)~2",
                "time_in_nanos" : 5050509,

Match Phrase Query

POST /movies/_search
{
  "profile": "true", 
  "query": {
    "match_phrase": {
      "title": {
        "query": "Matrix reload",
        "slop": 1
      }
    }
  }
}

//res
"profile" : {
    "shards" : [
      {
        "id" : "[BsfHcVuGT8-7CROZ1odZUg][movies][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "PhraseQuery",
                "description" : """title:"matrix reload"~1""",

Match Query 查询过程

基于全⽂本的查找
- Match Query / Match Phrase Query / Query String Query
基于全⽂本的查询的特点
- 索引和搜索时都会进⾏分词，查询字符串先传递到⼀个合适的分词器，然后⽣成⼀个供查询的词项列表
- 查询会对每个词项逐个进⾏底层的查询，再将结果进⾏合并。并为每个⽂档⽣成⼀个算分

2. 结构化搜索

结构化搜索（Structured search） 是指有关探询那些具有内在结构数据的过程。比如日期、时间和数字都是结构化的：它们有精确的格式，我们可以对这些格式进行逻辑操作。

比较常见的操作包括比较数字或时间的范围，或判定两个值的大小。

文本也可以是结构化的。如彩色笔可以有离散的颜色集合： 红（red） 、 绿（green） 、 蓝（blue） 。一个博客可能被标记了关键词 分布式（distributed） 和 搜索（search） 。

电商网站上的商品都有 UPCs（通用产品码 Universal Product Codes）或其他的唯一标识，它们都需要遵从严格规定的、结构化的格式。

在结构化查询中，我们得到的结果总是非是即否，要么存于集合之中，要么存在集合之外。结构化查询不关心文件的相关度或评分；它简单的对文档包括或排除处理。

这在逻辑上是能说通的，因为一个数字不能比其他数字更适合存于某个相同范围。结果只能是：存于范围之中，抑或反之。同样，对于结构化文本来说，一个值要么相等，要么不等。没有更似这种概念。

当进行精确值查找时，要使用过滤器（filters）。过滤器很重要，因为它们执行速度非常快，不会计算相关度（直接跳过了整个评分阶段）而且很容易被缓存，因此尽可能多的使用过滤式查询。　　　

term查询数字

为常用的 term 查询，可以用它处理数字（numbers）、布尔值（Booleans）、日期（dates）以及文本（text）创建并索引一些表示产品的文档，文档里有字段 `price` 和 `productID` （ `价格` 和 `产品ID` ）：

POST /my_store/products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }

通常当查找一个精确值的时候，我们不希望对查询进行评分计算。只希望对文档进行包括或排除的计算，所以我们会使用 constant_score 查询以非评分模式来执行 term 查询并以一作为统一评分。

最终组合的结果是一个 constant_score 查询，它包含一个 term 查询：

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "price" : 10
                }
            }
        }
    }
}

我们用 constant_score 将 term 查询转化成为过滤器，这个查询所搜索到的结果与我们期望的一致：只有文档 1 命中并作为结果返回（因为只有 1 的价格是 10）

term查询文本

使用 term 查询匹配字符串和匹配数字一样容易。例如查询产品号是XHDK-A-1293-#fJ3 的数据，也就是查询文档1

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "productID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

　　　　显然没有查询到想要的结果，为什么呢？问题不在 term 查询，而在于索引数据的方式，先查看productID的索引方式

GET /my_store/_analyze
{
  "field": "productID",
  "text": "XHDK-A-1293-#fJ3"
}

　　　　通过上面的结果，可以看到"XHDK-A-1293-#fJ3"这个数据被分成了四个部分，所以当我们用 term 查询查找精确值 XHDK-A-1293-#fJ3 的时候，找不到任何文档，因为它并不在我们的倒排索引中，

　　　　显然这种对 ID 码或其他任何精确值的处理方式并不是我们想要的。

　　　　为了避免这种问题，我们需要告诉 Elasticsearch 该字段具有精确值，要将其设置成 not_analyzed 无需分析的。

DELETE /my_store 

PUT /my_store 
{
    "mappings" : {
        "products" : {
            "properties" : {
                "productID" : {
                    "type" : "string",
                    "index" : "not_analyzed" 
                }
            }
        }
    }

}

　　　　注意：对Elastic 5.5版本以后的，string被text代替了，不过string还能用，而index对应的值是true或false。对应string类型的数据而言，not_analyzed这个数据还可以用，但是针对string类型数据，其它类型的数据不行。

删除索引是必须的，因为我们不能更新已存在的映射。

在索引被删除后，我们可以创建新的索引并为其指定自定义映射。

这里我们告诉 Elasticsearch ，我们不想对 productID 做任何分析。

现在我们可以为文档重建索引：

POST /my_store/products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }

再次查看productID的索引方式：

显然XHDK-A-1293-#fJ3数据没有被分析

重新查询产品号是XHDK-A-1293-#fJ3 的数据

查询成功

查找多个精确值

term 查询对于查找单个值非常有用，但通常我们可能想搜索多个值。如果我们想要查找价格字段值为 20或20或30 的文档该如何处理呢？

不需要使用多个 term 查询，我们只要用单个 terms 查询（注意末尾的 s ）， terms 查询好比是 term 查询的复数形式（以英语名词的单复数做比）。

它几乎与 term 的使用方式一模一样，与指定单个价格不同，我们只要将 term 字段的值改为数组即可：

与 term 查询一样，也需要将其置入 filter 语句的常量评分查询中使用：

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "terms" : { 
                    "price" : [20, 30]
                }
            }
        }
    }
}

运行结果返回第二、第三和第四个文档：

一定要了解 term 和 terms 是 包含（contains） 操作，而非 等值（equals） （判断）。如何理解这句话呢？

如果我们有一个 term（词项）过滤器 { "term" : { "tags" : "search" } } ，它会与以下两个文档同时匹配

范围查找

实际上，对数字范围进行过滤有时会更有用。例如，我们可能想要查找所有价格大于 20且小于20且小于40 美元的产品。

在 SQL 中，范围查询可以表示为：

Elasticsearch 有 range 查询，不出所料地，可以用它来查找处于某个范围内的文档：

range 查询可同时提供包含（inclusive）和不包含（exclusive）这两种范围表达式，可供组合的选项如下：

gt: > 大于（greater than）
lt: < 小于（less than）
gte: >= 大于或等于（greater than or equal to）
lte: <= 小于或等于（less than or equal to）

下面是一个范围查询的例子：.

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : {
                        "gte" : 20,
                        "lt"  : 40
                    }
                }
            }
        }
    }
}

如果想要范围无界（比方说 >20 ），只须省略其中一边的限制：

"range" : {
    "price" : {
        "gt" : 20
    }
}

日期范围

range 查询同样可以应用在日期字段上：

"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-07 00:00:00"
    }
}
当使用它处理日期字段时， range 查询支持对 日期计算（date math） 进行操作，比方说，如果我们想查找时间戳在过去一小时内的所有文档：

"range" : {
    "timestamp" : {
        "gt" : "now-1h"
    }
}
这个过滤器会一直查找时间戳在过去一个小时内的所有文档，让过滤器作为一个时间 滑动窗口（sliding window） 来过滤文档。

日期计算还可以被应用到某个具体的时间，并非只能是一个像 now 这样的占位符。只要在某个日期后加上一个双管符号 (||) 并紧跟一个日期数学表达式就能做到：

"range" : {
    "timestamp" : {
        "gt" : "2014-01-01 00:00:00",
        "lt" : "2014-01-01 00:00:00||+1M" 
    }
}
早于 2014 年 1 月 1 日加 1 月（2014 年 2 月 1 日 零时）
字符串范围
 查询同样可以处理字符串字段， 字符串范围可采用 字典顺序（lexicographically） 或字母顺序（alphabetically）。例如，下面这些字符串是采用字典序（lexicographically）排序的：

　　　　5, 50, 6, B, C, a, ab, abb, abc, b

　　　　在倒排索引中的词项就是采取字典顺序（lexicographically）排列的，这也是字符串范围可以使用这个顺序来确定的原因。

　　　　如果我们想查找从 a 到 b （不包含）的字符串，同样可以使用 range 查询语法：

"range" : {
    "title" : {
        "gte" : "a",
        "lt" :  "b"
    }
}

Computer_hello

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch那些事儿（五）

asd
复制链接

扫一扫

专栏目录

ElasticSearch那些事儿（五）

基于词项和全文的搜索

“相关推荐”对你有帮助么？