Elasticsearch --- （十一）初识搜索引擎《一》

梦里梦见梦不见的

于 2020-05-21 17:02:18 发布

阅读量514

点赞数

分类专栏： Elasticsearch

本文链接：https://blog.csdn.net/weixin_43240792/article/details/106256851

版权

Elasticsearch 专栏收录该内容

15 篇文章 1 订阅

订阅专栏

1、search结果深入解析（search timeout机制）

（1）搜索结果里的各种数据的含义

2、multi-index & multi-type搜索模式解析(一次性搜索多个index和多个type下的数据)以及搜索原理初步图解

（1）multi-index和multi-type搜索模式

（2）初步图解一下简单的搜索原理

3、分页搜索及deep paging性能问题深度图解

（1）使用es进行分页搜索 size，from

（2）deep paging（深度分页）性能问题，应避免deep paging

4、query string search语法以及_all metadata原理

（1）query string 语法

（2）_all metadata的原理和作用

8、分词器的内部组成到底是什么，以及内置分词器的介绍

（1）什么是分词器？将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

（2）内置分词器的介绍（4种）

9、query string的分词以及mapping引入案例遗留问题

11、mapping的核心数据类型以及dynamic mapping

（1）mapping的核心数据类型

（2）dynamic mapping

（3）查看mapping

12、手动建立和修改mapping以及定制string类型数据是否分词

（1）手动建立和修改mapping

（2）测试mapping

13、mapping复杂数据类型以及object类型数据底层结构

（1）multivalue field

（2）empty field

（3）object field

1、search结果深入解析（search timeout机制）

（1）搜索结果里的各种数据的含义

GET /_search

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "hits": {
    "total": 10,
    "max_score": 1,
    "hits": [
      {
        "_index": ".kibana",
        "_type": "config",
        "_id": "5.2.0",
        "_score": 1,
        "_source": {
          "buildNum": 14695
        }
      }
    ]
  }
}

took：整个搜索请求花费了多少毫秒
_shards：shards fail 的条件（primary和replica全部挂掉），不影响其他shard。默认情况下来说，一个搜索请求，会打到一个index的所有primary shard上去，当然了，每个primary shard都可能会有一个或多个replic shard，所以请求也可以到primary shard的其中一个replica shard上去。
hits.total：本次搜索，返回了几条结果
hits.max_score：本次搜索的所有结果中，最大的相关度分数是多少，每一条document对于search的相关度，越相关，_score分数越大，排位越靠前
hits.hits：默认查询前10条数据，完整数据，_score降序排序
timeout：默认无timeout，latency平衡completeness，手动指定timeout，timeout查询执行机制

timeout=10ms，timeout=1s，timeout=1m
GET /_search?timeout=10m

2、multi-index & multi-type搜索模式解析(一次性搜索多个index和多个type下的数据)以及搜索原理初步图解

（1）multi-index和multi-type搜索模式

/_search：所有索引，所有type下的所有数据都搜索出来
/index1/_search：指定一个index，搜索其下所有type的数据
/index1,index2/_search：同时搜索两个index下的数据
/*1,*2/_search：按照通配符去匹配多个索引
/index1/type1/_search：搜索一个index下指定的type的数据
/index1/type1,type2/_search：可以搜索一个index下多个type的数据
/index1,index2/type1,type2/_search：搜索多个index下的多个type的数据
/_all/type1,type2/_search：_all，可以代表搜索所有index下的指定type的数据

（2）初步图解一下简单的搜索原理

3、分页搜索及deep paging性能问题深度图解

（1）使用es进行分页搜索 size，from

GET /_search?size=10
GET /_search?size=10&from=0
GET /_search?size=10&from=20

（2）deep paging（深度分页）性能问题，应避免deep paging

4、query string search语法以及_all metadata原理

（1）query string 语法

GET /test_index/test_type/_search?q=test_field:test
GET /test_index/test_type/_search?q=+test_field:test
以上两个其实是等价的


// - 不包含这个关键词
GET /test_index/test_type/_search?q=-test_field:test

（2）_all metadata的原理和作用

GET /test_index/test_type/_search?q=test 直接可以搜索所有的field，任意一个field包含指定的关键字就可以搜索出来

我们在进行中搜索的时候，难道是对document中中每一个field都进行一次搜索吗？不是的

es中的_all元数据，在建立索引的时候，我们插入一条document，它里面包含了多个field，此时，es会自动将多个field的值，全部用字符串的方式串联起来，变成一个长的字符串，作为_all field的值，同时建立索引。后面如果在搜索的时候，没有对某个field指定搜索，就默认搜索_all field，其中是包含了所有field的值的

举个例子

{
"name": "jack",
"age": 26,
"email": "jack@sina.com",
"address": "guamgzhou"
}

"jack 26 jack@sina.com guangzhou"，作为这一条document的_all field的值，同时进行分词后建立对应的倒排索引

5、mapping是什么

自动或手动为index中的type建立的一种数据结构和相关配置，简称为mapping。

dynamic mapping，自动为我们建立index，创建type，以及type对应的mapping，mapping中包含了每个field对应的数据类型，以及如何分词等设置。也可以手动在创建数据之前，先创建index和type，以及type对应的mapping

1、插入几条数据，让es自动为我们建立一个索引

PUT /website/article/1
{
  "post_date": "2017-01-01",
  "title": "my first article",
  "content": "this is my first article in this website",
  "author_id": 11400
}

PUT /website/article/2
{
  "post_date": "2017-01-02",
  "title": "my second article",
  "content": "this is my second article in this website",
  "author_id": 11400
}

PUT /website/article/3
{
  "post_date": "2017-01-03",
  "title": "my third article",
  "content": "this is my third article in this website",
  "author_id": 11400
}

2、尝试各种搜索

GET /website/article/_search?q=2017			3条结果             
GET /website/article/_search?q=2017-01-01        	3条结果
GET /website/article/_search?q=post_date:2017-01-01   	1条结果
GET /website/article/_search?q=post_date:2017         	1条结果

3、GET /website/_mapping/article

{
  "website": {
    "mappings": {
      "article": {
        "properties": {
          "author_id": {
            "type": "long"
          },
          "content": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "post_date": {
            "type": "date"
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

搜索结果为什么不一致，因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。

6、精确匹配与全文搜索的对比分析

（1）exact value（精确匹配）

比如搜索2017-01-01，使用exact value搜索的时候，必须输入2017-01-01，才能搜索出来；如果输入一个01，是搜索不出来的。

（2）full text（全文搜索）

缩写 vs. 全程：cn vs. china，搜索cn，也可以将china搜索出来。
格式转化：like liked likes，搜索like，也可以将likes搜索出来。
大小写：Tom vs tom，搜索tom，也可以将Tom搜索出来
同义词：like vs love，搜索love，同义词，也可以将like搜索出来。

就不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

7、倒排索引核心原理

8、分词器的内部组成到底是什么，以及内置分词器的介绍

（1）什么是分词器？将一段文本进行各种处理，最后处理好的结果才会拿去建立倒排索引

a、切分词语

b、normalization（提升recall召回率：搜索的时候，增加能够搜索到的结果的数量）

c、分词器包含三部分

character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）
tokenizer：分词，hello you and me --> hello, you, and, me
token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little（时态转换，单复数转换等）

（2）内置分词器的介绍（4种）

例如：Set the shape to semi-transparent by calling set_trans(5)

standard analyzer：set, the, shape, to, semi, transparent, by, calling, set_trans, 5（默认的是standard）
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

9、query string的分词以及mapping引入案例遗留问题

（1）query string分词

query string必须以和index建立时相同的analyze（分词器）r进行分词
query string对exact value和full text的区别对待

post_date，date：exact value
_all：full text，分词，normalization

（2）mapping引入案例遗留问题

（3）测试分词器


GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}

10、mapping再次回炉

（1）往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping
（2）mapping中就自动定义了每个field的数据类型
（3）不同的数据类型（比如说text和date），可能有的是exact value，有的是full text
（4）exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中
（5）同时，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索
（6）可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等

mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为

11、mapping的核心数据类型以及dynamic mapping

（1）mapping的核心数据类型

                string
                byte，short，integer，long
                float，double
                boolean
                date

（2）dynamic mapping

               true or false   -->   boolean
               123       -->   long
               123.45       -->   double
               2017-01-01   -->   date
               "hello world"   -->   string/text

（3）查看mapping

GET /index/_mapping/type

12、手动建立和修改mapping以及定制string类型数据是否分词

（1）手动建立和修改mapping

只能创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping

------------------------手动建立
PUT /website
{
  "mappings": {
    "article": {
      "properties": {
        "author_id": {
          "type": "long"
        },
        "title": {
          "type": "text",
          "analyzer": "english"
        },
        "content": {
          "type": "text"
        },
        "post_date": {
          "type": "date"
        },
        "publisher_id": {
          "type": "text",
          "index": "not_analyzed"
        }
      }
    }
  }
}

----------------------------修改
PUT /website
{
  "mappings": {
    "article": {
      "properties": {
        "author_id": {
          "type": "text"
        }
      }
    }
  }
}

结果报错：
{
  "error": {
    "root_cause": [
      {
        "type": "index_already_exists_exception",
        "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
        "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
        "index": "website"
      }
    ],
    "type": "index_already_exists_exception",
    "reason": "index [website/co1dgJ-uTYGBEEOOL8GsQQ] already exists",
    "index_uuid": "co1dgJ-uTYGBEEOOL8GsQQ",
    "index": "website"
  },
  "status": 400
}


---------------------如果想修改的话，只能新增
PUT /website/_mapping/article
{
  "properties" : {
    "new_field" : {
      "type" :    "string",
      "index":    "not_analyzed"
    }
  }
}

（2）测试mapping

GET /website/_analyze
{
  "field": "content",
  "text": "my-dogs" 
}

13、mapping复杂数据类型以及object类型数据底层结构

（1）multivalue field

{ "tags": [ "tag1", "tag2" ]} 建立索引时与string是一样的，数据类型不能混

（2）empty field

null，[]，[null]

（3）object field

PUT /company/employee/1
{
  "address": {
    "country": "china",
    "province": "guangdong",
    "city": "guangzhou"
  },
  "name": "jack",
  "age": 27,
  "join_date": "2017-01-01"
}

其中address：object类型

-----------------object field底层数据类型
{
    "name":            [jack],
    "age":          [27],
    "join_date":      [2017-01-01],
    "address.country":         [china],
    "address.province":   [guangdong],
    "address.city":  [guangzhou]
}


{
    "authors": [
        { "age": 26, "name": "Jack White"},
        { "age": 55, "name": "Tom Jones"},
        { "age": 39, "name": "Kitty Smith"}
    ]
}
//对应的底层数据类型
{
    "authors.age":    [26, 55, 39],
    "authors.name":   [jack, white, tom, jones, kitty, smith]
}

梦里梦见梦不见的

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch --- （十一）初识搜索引擎《一》

1、search结果深入解析（search timeout机制）（1）搜索结果里的各种数据的含义GET /_search{ "took": 6, "timed_out": false, "_shards": { "total": 6, "successful": 6, "failed": 0 }, "hits": { "total": 10, "max_score": 1, "hits": [ {
复制链接

扫一扫

专栏目录