ElasticSearch的深入学习

序言

MySQL数据库中有如下图结构的一张表

有一个需求,要求查询出包含"傻逼"的所有数据,此时我们的解决办法当然是模糊匹配咯。

模糊匹配虽然能解决问题,但由于模糊匹配会使索引失效,索引也用不上。

总得来说太慢了,在大数据动不动百万千万上亿的情况,怎么可能去模糊匹配那么条的数据,等你模糊出来天都黑了。

所以!!!!!!!盖世英雄登场!!!!!!!!!!


什么是倒排索引?

如图所示,将每条数据进行分词,统计分词后单个词出现的频率,在数据中所在的位置,指向了哪些数据。

数据结构

单词词典(Term Dictionary)

我们都学习过词典树,通过遍历词典的所有字符构建一棵词典树。

但是字典树的缺点也很明显,假如有a->b->c,a->c两个词,c已经存在了,但是没办法复用,只能重新开辟新的分支。在大数据的情况下,会很消耗空间。所以ES用了一种FST的数据结构,来优化词典树这种缺点。(FST待了解...........)

倒排表(Posting List)

举例文章中"傻逼"分词后,遍历词典最终指向了倒排表中的索引位置。

该条索引记录了"傻逼"一词出现文章的Id, 出现频率,出现文章中的位置,和开始结束的位置。

文档Id: 1,2,3

词频TF:3

位置:4

偏移量:"我是大傻逼"(4,5)。

记录文档ID的是一个int数据,大数据中百万千万上亿级别的文档数,假如现在有500W条文档关联了"傻逼"一词,那么倒排表里就得记录这500W条文档的ID,这也要占用非常大的空间资源,为了解决对应的问题,出现了两种压缩算法。

FOR算法

现在"傻逼"一次关联的 文档Id有6个,分别是1,8,96,250,3600,48000。

在不压缩的情况下占用的大小为6 int x 4byte x 8bit = 196bit。

压缩算法V(n)=V(n)-V(n-1), 分别是1,7,88,154,3350,44400。最大数为44400,2的14次方装得下44400,也就是说装一个Id需要14个bit,此外计量一个桶所需的大小也需要一个桶,所以就是6 x 14 + 14 = 98bit。相比196bit,压缩了一倍。

RBM算法

如果Id现在为100,1000,10000,196658这种稀疏排列的数组,再用FOR算法效果并不是很好。所以出现了RBM算法。

100的二进制0000 0000 0000 0000 0000 0000 0110 0100,前十六位转化为二进制为0,后十六位转化为二进制为100, (0,100)

196658的二进制0000 0000 0000 0011 0000 0000 0011 0010,前十六位转化为二进制为3,后十六位转化为二进制为50,(3,50)。

容器用bitmap实现,而Java中实现bitmap底层用的long[]数据,1 long = 8 byte = 64bit。


什么是节点?

ES7.0支持的节点角色有master、coordinate、data、Ingest,且一个节点可以充当多个角色。

master(主节点)

一个集群中可以拥有多个master节点,主要负责索引的创建和删除、分配分片给data node、记录集群中其他节点的信息。文档级别的创建是不会经过matser节点的。

coordinate(协调节点)

一个集群中可以拥有多个coordinate节点,主要负责请求的转发和接收汇总处理,说白了负载均衡。

data(数据节点)

一个集群中可以拥有多个data节点,data节点主要负责存储分片、副本、文档。文档级别的操作都是在data节点上处理的。

ingest(前置节点)

ES7.0之后支持了这叫前置节点,也可以叫预处理节点。ingest节点大部分时间不会参加到上面集群中的交互,它的主要作用是在索引写入时,对默认的字段名进行更改或实时写入时间戳。


什么是分片,副本?

一个索引在master节点被创建时,master节点会指定分片和副本创建在哪一个数据节点。一个分片就相当于一个lucence实例。

副本就是主分片的副本分片,用于数据备份。

文档写入原理

ES中数据节点由分片构成,一个分片都是一个lucence索引实例,每一个lucence都包含了多个segement,文档数据都是存储在一个个segement当中。

当一个文档写入时,流程如下:

1.经过协调节点路由选择一个数据节点写入文档。

2.文档达到数据节点后,ES会将文档同时写入translog日志文件和buffer缓冲区此时文档还是无法搜索的状态

3.buffer缓冲区默认1秒的操作进行refresh操作,将buffer缓冲区里的文档写入到一个新创建的segement中,文件中然后segement会进入os的缓存中,这时候文档是可搜索的。

4.默认30分钟或translog会触发flush操作。flush主要的行为1.将buffer中的文档强制refresh到缓存中。2.translog记录的都是每一次ES操作,flush会创建commit point,每一个point对应一个segement,最后落库。3.将缓存中的segement刷盘。4.删除旧translog,创建新的translog。

补充:

segement会有merge操作,因为search的话要要遍历每个segement,浪费系统资源。所以当segement达到一定数量和大小的时候,会进行merge操作。当merge完成时删除原有的segement和.del文件。

ES的删除操作只是将文档标识为删除,记录到.del文件中,当merge操作时文档才是真正意义上被删除。编辑操作同理,当文档第一次被创建时,segement已经固定,不能修改。后续编辑是将原来的文档标识为删除,再新建一份文档,更新version字段。


什么是分词器?

一个标准的分词器包含三个部分,character filterstokenizerstoken filters

ES支持根据filter自定义配置分词器。

Character filters

字符过滤器将原始文本作为字符流接收,并可以通过添加、删除或更改字符来转换流。 例如,字符过滤器可用于将印度-阿拉伯数字 (٠ ١٢٣٤٥٦٧٨ ٩ ) 转换为阿拉伯-拉丁语等价物 (0123456789),或从流中去除诸如“”之类的 HTML 元素。

分词器可能有零个或多个 字符过滤器,它们按顺序应用。

Tokenizers

tokenizer分词器 接收字符流,将其分解为单独的 tokens(通常是单个单词),并输出 tokens 流。 例如,whitespace 分词器在看到任何空格时将文本分解为tokens。 它会将文本“Quick brown fox!”转换为terms[Quick, brown, fox!]。

tokenizer分词器还负责记录每个词条的顺序或位置,以及该词条所代表的原始单词的开始和结束字符偏移量。

分词器必须具有正好一个 tokenizer。

Token filters

token 过滤器 接收token流并可以添加、删除或更改token 。 例如,一个 lowercase 标记过滤器将所有标记转换为小写,一个 stop 标识过滤器从标识流中删除常用词(停用词),如 the ,并且 synonym token过滤器将同义词引入token流。

token 过滤器不允许更改每个token 的位置或字符偏移量。

分词器可能有零个或多个 token filters,它们按顺序应用。

ES提供的分词器

Standard

特点:大写转小写、按词切分

GET _analyze
{
  "analyzer":"standard",
  "text":"1 I am the hero, GG"
}

#result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "i",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "am",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "the",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "hero",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "gg",
      "start_offset" : 17,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

Simple

特点:把所有非字母的字符全部去掉、大写转小写

GET _analyze
{
  "analyzer":"simple",
  "text":"1 I am the-hero"
}

#result
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "the",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "hero",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    }
  ]
}

Whitespace

特点:按空格切分

GET _analyze
{
  "analyzer":"whitespace",
  "text":"1 I am the-hero"
}

#result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "I",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "am",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "the-hero",
      "start_offset" : 7,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    }
  ]
}

Stop

特点:去掉the a等修饰性词,把所有非字母的字符全部去掉

GET _analyze
{
  "analyzer":"stop",
  "text":"1 I am a the-hero"
}

#result
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hero",
      "start_offset" : 11,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    }
  ]
}

Keyword

特点:不分词

GET _analyze
{
  "analyzer":"keyword",
  "text":"1 I am a the-hero"
}
 
#result
{
  "tokens" : [
    {
      "token" : "1 I am a the-hero",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}

Pattern

特点:可以根据正则去分词(默认按非字符分词),大写转小写

GET _analyze
{
  "analyzer":"pattern",
  "text":"1 I am a the-hero"
}
 
#result
{
  "tokens" : [
    {
      "token" : "1",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "i",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "am",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "a",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "the",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "hero",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "word",
      "position" : 5
    }
  ]
}

中文分词器

#安装分词器 ES/bin
./elasticsearch-plugin install analysis-icu

GET _analyze
{
  "analyzer":"icu_analyzer",
  "text":"你说得十分有理"
}
 
#result
{
  "tokens" : [
    {
      "token" : "你",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "说得",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "十分",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "有理",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

自定义分词器

PUT /user
{
 "settings": {
   "analysis": {
     "analyzer": {
       "my_custom_analyzer":{
         "type":"custom",
         "char_filter":["emoticons"],
         "tokenizer":"punctuation",
         "filter":["lowercase","english_stop"]
       }
     },
     "tokenizer": {
       "punctuation":{
         "type":"pattern",
         "pattern":"[ .,!?]"
       }
     },
     "char_filter": {
       "emoticons":{
         "type":"mapping",
         "mappings":[":) => _happy_",":( => _sad_"]
       }
     },
     "filter": {
       "english_stop":{
         "type":"stop",
         "stopwords":"_english_"
       }
     }
   }
 }
}

#result
{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "_happy_",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 2
    }
  ]
}

什么是Mapping?

手动设置mapping

PUT /user
{
  "mappings":{
    "properties": {
      "firstname":{
        "type":"keyword", # keyword和text 区别在于keyword不会分词 text会分词
        "copy_to": "fullname" #类似于临时表 可以创建临时字段进行查询
      },
      "lastname":{
        "type":"keyword",
        "copy_to": "fullname"
      },
      "address":{
        "type":"text"
      },
      "phone":{
        "type":"long",
        "index":false, # 设置此项 条件查询无法查询到,因为不会建立索引
         "null_value":"NULL" #允许值为null 查询null的时候就查询得出来
      }
    }
  }
}

什么是CRUD和Bulk?

创建索引

PUT my_index

删除索引

DELETE my_index

查询索引

#查询所有索引
GET _cat/indices

#查询当前索引的mapping和setting的详情
GET my_index

更新索引

后续引入

创建和更新文档

#put 必须指定id(文档已存在则是更新) 
PUT my_index/_doc/01
{
  "desc":"test"  
}

#post 自动生成id(文档已存在则是更新) 
POST my_index/_doc 
{
  "desc":"test"  
}

删除文档

DELETE my_index/_doc/01

bulk批量操作

#create
POST my_index/_bulk
{"create":{"_id":"03"}}
{"desc":"哈哈哈哈哈啊哈"}
{"create":{"_id":"04"}}
{"desc":"哇擦擦擦"}

#index
POST my_index/_bulk
{"index":{"_id":"03"}}
{"desc":"哈哈哈哈哈啊11哈"}
{"index":{"_id":"04"}}
{"desc":"哇擦擦擦11"}

#update
POST my_index/_bulk
{"update":{"_id":"03"}}
{"doc":{"desc":"哈哈哈哈哈啊哈22222"}}
{"update":{"_id":"04"}}
{"doc":{"desc":"哇擦擦擦22222"}}

#delete
POST my_index/_bulk
{"delete":{"_id":"03"}}
{"delete":{"_id":"04"}}

什么是基础查询?

ES支持直接用URL查询,但是对于一些比较复杂的查询条件,用URL查询不太方便,所以有了DSL。

term、terms

term查询不会将搜索关键词进行分词。

1.text = "I love money"
2.ES自动分类为text, text类型会分词为[i,love,money]
3.查询关键词为 "I love money",term不会对查询关键词分词,也就是说拿着"I love money"去匹配[i,love,money]
4.解决办法 1.使用.keyword 2.查询关键词为 i,love,money]中任一一个

例子如下
PUT /test/_doc/01
{
    "title": "love China",
    "content": "love China",
    "tags": ["China", "love"]
}

PUT /test/_doc/02
{
    "title": "love HuBei",
    "content": "people very love HuBei",
    "tags": ["HuBei", "love"]
}

办法一 keyword
GET /test/_search
{
  "query":{
    "term": {
      "title.keyword": "love HuBei"
    }
  }
}

办法二 分词查询
GET /test/_search
{
  "query":{
    "term": {
      "title": "love"
     }
  }
}

方法三 bool
GET /test/_search
{
  "query":{
    "bool": {
      "must": [
        {"term":{"title":"love"}},
        {"term":{"title":"HuBei"}}
      ]
    }
  }
}

terms查询可以查询多个词(or关系)

1.如果想查出包含man或者love
例子如下

PUT /test/_doc/03
{
    "title": "Spider Man",
    "content": "xxxxxx",
    "tags": ["x", "x"]
}

GET /test/_search
{
  "query":{
    "terms": {
      "title": ["love","man"]
     }
  }
}

match_all、match、match_phrase、mulit_match

match_all就是无条件查询嘛

例子如下
PUT /test/_doc/01
{
    "title": "love China",
    "content": "love China",
    "tags": ["China", "love"]
}

PUT /test/_doc/02
{
    "title": "love HuBei",
    "content": "people very love HuBei",
    "tags": ["HuBei", "love"]
}

GET /test/_search
{
  "query":{
    "match_all":{}
  }
}

match会对搜索关键词分词

1.text = "I love money"
2.ES自动分类为text, text类型会分词为[i,love,money]
3.查询关键词为 "I love money",match分词为[i,love,money],关系为or,只要满足任意一个条件就会被查询出来
例子如下

GET /test/_search
{
  "query":{
    "match":{
      "title":{
        "query":"love hubei"
      }
    }
  }
}

想要让搜索条件变为and,可以使用operator或keyword
例子如下
GET /test/_search
{
  "query":{
    "match":{
      "title":{
        "query":"love hubei",
        "operator": "AND"
      }
    }
  }
}

GET /test/_search
{
  "query":{
    "match":{
      "title.keyword":{
        "query":"love HuBei"
      }
    }
  }
}

match_phrase默认查询关系就是and

GET /test/_search
{
  "query":{
    "match_phrase": {
      "title": "love hubei"
    }
  }
}

muilt_match就是关键词匹配多个字段

例子如下
GET /test/_search
{
  "query":{
    "multi_match": {
      "query": "love china",
      "fields": ["title","china"]
    }
  }
}

#multi_match有type属性,分别为“best_fields”,"most_fields",其中“best_fields”的效果就是多个字段取评分最高的字段作为最终评分,和dis_max查询效果一致

分页查询

GET movies/_search
{
  "query":{
    "match_all":{}
  },
  "from":0,
  "size":2
}

#result
{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 1.0,
        "_source" : {
          "id" : "13",
          "@version" : "1",
          "genre" : [
            "Adventure",
            "Animation",
            "Children"
          ],
          "year" : 1995,
          "title" : "Balto"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "257",
        "_score" : 1.0,
        "_source" : {
          "id" : "257",
          "@version" : "1",
          "genre" : [
            "Mystery",
            "Thriller"
          ],
          "year" : 1995,
          "title" : "Just Cause"
        }
      }
    ]
  }
}

排序查询

GET movies/_search
{
  "query":{
    "match_all":{}
  },
  "from":0,
  "size":2,
  "sort":[{"year":"desc"}]
}
 
#result
{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "199237",
        "_score" : null,
        "_source" : {
          "id" : "199237",
          "@version" : "1",
          "genre" : [
            "Drama"
          ],
          "year" : 2019,
          "title" : "Paddleton"
        },
        "sort" : [
          2019
        ]
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "199255",
        "_score" : null,
        "_source" : {
          "id" : "199255",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Western"
          ],
          "year" : 2019,
          "title" : "The Kid"
        },
        "sort" : [
          2019
        ]
      }
    ]
  }
}

筛选字段查询

GET movies/_search
{
  "query":{
    "match_all":{}
  },
  "from":0,
  "size":2,
  "sort":[{"year":"desc"}],
  "_source": ["year"]
}

#result
{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "143345",
        "_score" : null,
        "_source" : {
          "year" : 2019
        },
        "sort" : [
          2019
        ]
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "196223",
        "_score" : null,
        "_source" : {
          "year" : 2019
        },
        "sort" : [
          2019
        ]
      }
    ]
  }
}

脚本字段查询

GET movies/_search
{
  "script_fields":{
    "new_field":{
      "script":{
        "lang":"painless",
        "source":"doc['year'].value + '年'"
      }
    }
  },
  "query":{
    "match_all":{}
  },
  "from":0,
  "size":2
}

#result
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 1.0,
        "fields" : {
          "new_field" : [
            "1995年"
          ]
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "257",
        "_score" : 1.0,
        "fields" : {
          "new_field" : [
            "1995年"
          ]
        }
      }
    ]
  }
}

条件查询

GET movies/_search
{
  "query":{
    "match":{
      "title":"one love" #使用match, 查询条件是or, 只要title中存在one或者love,就匹配得到
    }
  }
}

#result
{
  "took" : 26,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1273,
      "relation" : "eq"
    },
    "max_score" : 10.874638,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "141583",
        "_score" : 10.874638,
        "_source" : {
          "id" : "141583",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2011,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "176591",
        "_score" : 10.676434,
        "_source" : {
          "id" : "176591",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2003,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "113829",
        "_score" : 8.255679,
        "_source" : {
          "id" : "113829",
          "@version" : "1",
          "genre" : [
            "Comedy",
            "Drama",
            "Romance"
          ],
          "year" : 2014,
          "title" : "One I Love, The"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "146620",
        "_score" : 8.255679,
        "_source" : {
          "id" : "146620",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 1972,
          "title" : "For Love One Dies"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "124607",
        "_score" : 8.114714,
        "_source" : {
          "id" : "124607",
          "@version" : "1",
          "genre" : [
            "Romance"
          ],
          "year" : 1934,
          "title" : "One Night of Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "112060",
        "_score" : 7.214248,
        "_source" : {
          "id" : "112060",
          "@version" : "1",
          "genre" : [
            "Drama"
          ],
          "year" : 1977,
          "title" : "One on One"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "140994",
        "_score" : 7.214248,
        "_source" : {
          "id" : "140994",
          "@version" : "1",
          "genre" : [
            "(no genres listed)"
          ],
          "year" : 2011,
          "title" : "One. Two. One"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "141608",
        "_score" : 7.0476203,
        "_source" : {
          "id" : "141608",
          "@version" : "1",
          "genre" : [
            "Drama"
          ],
          "year" : 2014,
          "title" : "One on One"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "206933",
        "_score" : 6.6144905,
        "_source" : {
          "id" : "206933",
          "@version" : "1",
          "genre" : [
            "(no genres listed)"
          ],
          "year" : 2013,
          "title" : "Love. Love. Love."
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "197013",
        "_score" : 6.2997913,
        "_source" : {
          "id" : "197013",
          "@version" : "1",
          "genre" : [
            "Drama"
          ],
          "year" : 2018,
          "title" : "One Nation, One King"
        }
      }
    ]
  }
}
GET movies/_search
{
  "query":{
    "match":{
      "title":{
        "query":"one love", #想查询同时包含one love的写法
        "operator":"AND"
      }
    }
  }
}

#result
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 10.874638,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "141583",
        "_score" : 10.874638,
        "_source" : {
          "id" : "141583",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2011,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "176591",
        "_score" : 10.676434,
        "_source" : {
          "id" : "176591",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2003,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "113829",
        "_score" : 8.255679,
        "_source" : {
          "id" : "113829",
          "@version" : "1",
          "genre" : [
            "Comedy",
            "Drama",
            "Romance"
          ],
          "year" : 2014,
          "title" : "One I Love, The"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "146620",
        "_score" : 8.255679,
        "_source" : {
          "id" : "146620",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 1972,
          "title" : "For Love One Dies"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "124607",
        "_score" : 8.114714,
        "_source" : {
          "id" : "124607",
          "@version" : "1",
          "genre" : [
            "Romance"
          ],
          "year" : 1934,
          "title" : "One Night of Love"
        }
      }
    ]
  }
}

 

短语查询

GET movies/_search
{
  "query":{
    "match_phrase":{
      "title":{
        "query":"one love"
      }
    }
  }
}

#result
{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 10.874638,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "141583",
        "_score" : 10.874638,
        "_source" : {
          "id" : "141583",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2011,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "176591",
        "_score" : 10.676434,
        "_source" : {
          "id" : "176591",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2003,
          "title" : "One Love"
        }
      }
    ]
  }
}
GET movies/_search
{
  "query":{
    "match_phrase":{
      "title":{
        "query":"one love",
        "slop":1 #允许词汇中有几个其他词汇的写法
      }
    }
  }
}

#result
{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 10.874638,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "141583",
        "_score" : 10.874638,
        "_source" : {
          "id" : "141583",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2011,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "176591",
        "_score" : 10.676434,
        "_source" : {
          "id" : "176591",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2003,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "113829",
        "_score" : 5.155677,
        "_source" : {
          "id" : "113829",
          "@version" : "1",
          "genre" : [
            "Comedy",
            "Drama",
            "Romance"
          ],
          "year" : 2014,
          "title" : "One I Love, The"
        }
      }
    ]
  }
}

 

QueryString

GET movies/_search
{
  "query":{
    "query_string": {
      "default_field": "title",
      "query": "one AND love"
    }
  }
}

#result
{
  "took" : 31,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 10.874638,
    "hits" : [
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "141583",
        "_score" : 10.874638,
        "_source" : {
          "id" : "141583",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2011,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "176591",
        "_score" : 10.676434,
        "_source" : {
          "id" : "176591",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 2003,
          "title" : "One Love"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "113829",
        "_score" : 8.255679,
        "_source" : {
          "id" : "113829",
          "@version" : "1",
          "genre" : [
            "Comedy",
            "Drama",
            "Romance"
          ],
          "year" : 2014,
          "title" : "One I Love, The"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "146620",
        "_score" : 8.255679,
        "_source" : {
          "id" : "146620",
          "@version" : "1",
          "genre" : [
            "Drama",
            "Romance"
          ],
          "year" : 1972,
          "title" : "For Love One Dies"
        }
      },
      {
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "124607",
        "_score" : 8.114714,
        "_source" : {
          "id" : "124607",
          "@version" : "1",
          "genre" : [
            "Romance"
          ],
          "year" : 1934,
          "title" : "One Night of Love"
        }
      }
    ]
  }
}


什么是算分查询?

bool

ES提供Query Context(会对搜索进行相关性算分)和Filter Context(不需要相关性算分,能够利用缓存来获得更好的性能)两种查询方式。

bool查询中包含must、should、must_not、filter四种关键词。

must、should会贡献算法,must_not、filter不会贡献算分。

例子如下
GET /test/_search
{
  "query":{
    "bool": {
      "must": [
        {"term": {"title": "love"}},
        {"term":{"title":"hubei"}}
      ]
    }
  }
}

GET /test/_search
{
  "query":{
    "bool": {
      "should": [
        {"term": {"title": "love"}},
        {"term":{"title":"hubei"}}
      ]
    }
  }
}

GET /test/_search
{
  "query":{
    "bool": {
      "must_not": [
        {"term":{"title":"hubei"}}
      ]
    }
  }
}

GET /test/_search
{
  "query":{
    "bool": {
      "filter": [
        {"term": {"title": "spider"}}
      ]
    }
  }
}

boost

bool和boost联合使用

看到其他博主的例子,如果我们去找酒店有三个需求,有泳池有花园有wifi
需求度泳池>花园>wifi
必须要有泳池,可以有花园或者wifi任一一个

例子如下
PUT /hotel/_doc/01
{
    "desc": "we have pool,pool is very big, we belive you will love our pool,we also have garden, so big garden,of course, wifi"
}

PUT /hotel/_doc/01
{
    "desc": "we have garden,garden is very big, we belive you will love our garden,we also have pool, so big pool,of course, wifi"
}

PUT /hotel/_doc/01
{
    "desc": "we have wifi,wifi is very big, we belive you will love our wifi,we also have pool, so big pool,of course, garden"
}

不加boost的得分结果都一毛一样,如果我们想搜索出更符合我们的条件的酒店,这样的结果是不满意的
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.5269721,
    "hits" : [
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "01",
        "_score" : 0.5269721,
        "_source" : {
          "desc" : "we have pool,pool is very big, we belive you will love our pool,we also have garden, so big garden,of course, wifi"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "02",
        "_score" : 0.5269721,
        "_source" : {
          "desc" : "we have garden,garden is very big, we belive you will love our garden,we also have pool, so big pool,of course, wifi"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "03",
        "_score" : 0.5269721,
        "_source" : {
          "desc" : "we have wifi,wifi is very big, we belive you will love our wifi,we also have pool, so big pool,of course, garden"
        }
      }
    ]
  }
}

加上boost关键字
GET hotel/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {
          "desc": {
            "value": "pool",
            "boost": 3
          }
        }}
      ],
      "should": [
        {"term": {
          "desc": {
            "value": "garden"
            , "boost": 2
          }
        }},
        {"term": {
          "desc": {
            "value": "wifi"
            , "boost": 1
          }
        }}
      ]
    }
  }
}

#result
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.1302478,
    "hits" : [
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "01",
        "_score" : 1.1302478,
        "_source" : {
          "desc" : "we have pool,pool is very big, we belive you will love our pool,we also have garden, so big garden,of course, wifi"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "02",
        "_score" : 1.1040184,
        "_source" : {
          "desc" : "we have garden,garden is very big, we belive you will love our garden,we also have pool, so big pool,of course, wifi"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "03",
        "_score" : 1.0277148,
        "_source" : {
          "desc" : "we have wifi,wifi is very big, we belive you will love our wifi,we also have pool, so big pool,of course, garden"
        }
      }
    ]
  }
}

boost查询

现在我们对酒店的要求是一定要泳池,不希望出现wifi。

新插入一条数据
PUT /hotel/_doc/04
{
    "desc": "we have pool,pool is very big, we belive you will love our pool,we also have garden, so big garden"
}

使用boost查询
GET hotel/_search
{
  "query":{
    "boosting": {
      "positive": {
        "term": {
          "desc": {
            "value": "pool"
          }
        }
      },
      "negative": {
        "term": {
          "desc": {
            "value": "wifi"
          }
        }
      },
      "negative_boost": 0.5
    }
  }
}

#result
{
  "took" : 19,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4,
      "relation" : "eq"
    },
    "max_score" : 0.16907263,
    "hits" : [
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "04",
        "_score" : 0.16907263,
        "_source" : {
          "desc" : "we have pool,pool is very big, we belive you will love our pool,we also have garden, so big garden"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "01",
        "_score" : 0.08221495,
        "_source" : {
          "desc" : "we have pool,pool is very big, we belive you will love our pool,we also have garden, so big garden,of course, wifi"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "02",
        "_score" : 0.07178409,
        "_source" : {
          "desc" : "we have garden,garden is very big, we belive you will love our garden,we also have pool, so big pool,of course, wifi"
        }
      },
      {
        "_index" : "hotel",
        "_type" : "_doc",
        "_id" : "03",
        "_score" : 0.07178409,
        "_source" : {
          "desc" : "we have wifi,wifi is very big, we belive you will love our wifi,we also have pool, so big pool,of course, garden"
        }
      }
    ]
  }
}

dis_max

权重算分是根据多个字段去综合算分,dis_max可以取所有字段中评分最高的分数作为最终得分。

1.我们想找"banana cookie",希望出现的搜索结果是最匹配的
2.用bool should match组合搜索确实id=01的权重分高
3.原因id=01的文档,title里出现了cookie,得分!desc出现了cookie得分!两个字段加起来算分肯定比文档02只有title有关键词的分高
4.这里加权重分也没办法处理,因为希望所有字段都去匹配,然后尽量出现精准匹配的情况

例子如下
POST test/_bulk
{"create":{"_id":"01"}}
{"title":"I love cookie", "desc":" cookie is wonderful"}
{"create":{"_id":"02"}}
{"title":"I love banana cookie", "desc":"it is wonderful"}

GET test/_search
{
  "query":{
    "bool": {
      "should": [
        {"match":{"title": "banana cookie"}},
        {"match":{"desc":"banana cookie"}}
      ]
    }
  }
}

使用dis_max查询
GET test/_search
{
  "query":{
    "dis_max": {
      "queries": [
        {"match":{"title": "banana cookie"}},
        {"match":{"desc":"banana cookie"}}
      ]
    }
  }
}

constant_score

常量得分在不需要根据算分排序下可以使用。

GET test/_search
{
  "query":{
    "constant_score": {
      "filter": {
        "term": {
          "desc": "wonderful"
        }
      }
    }
  }
}

function_score

es进行全文搜索时,搜索结果默认会以文档的相关度进行排序,如果想要改变默认的排序规则,也可以通过sort指定一个或多个排序字段。但是使用sort排序过于绝对,它会直接忽略掉文档本身的相关度。

在很多时候这样做的效果并不好,这时候就需要对多个字段进行综合评估,得出一个最终的排序。这时就需要用到function_score 查询(function_score query) ,在 Elasticsearch 中function_score是用于处理文档分值的 DSL,它会在查询结束后对每一个匹配的文档进行一系列的重打分操作,最后以生成的最终分数进行排序。

POST test/_bulk
{"create":{"_id":"01"}}
{"desc":"Java入门","type":"哈哈","count":5}
{"create":{"_id":"02"}}
{"desc":"Java入门到不会","type":"哈呵呵","count":50}

#如果我们想查询Java入门相关的文章,根据文章浏览数去获得最佳的结果
GET test/_search
{
  "query":{
    "function_score": {
      "query": {
        "match": {
          "desc": "java入门"
        }
      },
      "functions": [
        {
          "script_score": {
            "script": {
              "params": {
                "access_num_ratio": 2.5
              },
          "lang": "painless",
          "source": "doc['count'].value * params.access_num_ratio "
            }
          }
        }
      ]
    }
  }
}

什么是聚合查询?

ES的聚合查询分为桶聚合(Bucket aggregations)、指标聚合(Metrics aggregations)、管道聚合(Pipeline aggregations)。

创建联系数据和文档。

PUT food
{
  "mappings": {
    "date_detection": false, 
    "properties": {
      "CreateTime":{
        "type":"date",
        "format": "yyyy-MM-dd HH:mm:ss" 
      },
      "Desc":{
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword", 
            "ignore_above":256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "Level":{
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      },
      "Name":{
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        },
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "Price":{
        "type": "float"
      },
      "Tags":{
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      },
      "Type":{
        "type": "text",
        "fields": {
          "keyword":{
            "type":"keyword",
            "ignore_above":256
          }
        }
      }
    }
  }
}

PUT food/_doc/1
{
  "CreateTime":"2022-06-06 11:11:11",
  "Desc":"青菜 yyds 营养价值很高,很好吃",
  "Level":"普通蔬菜",
  "Name":"青菜",
  "Price":11.11,
  "Tags":["性价比","营养","绿色蔬菜"],
  "Type":"蔬菜"
}

PUT food/_doc/2
{
  "CreateTime":"2022-06-06 13:11:11",
  "Desc":"大白菜 好吃 便宜 水分多",
  "Level":"普通蔬菜",
  "Name":"大白菜",
  "Price":12.11,
  "Tags":["便宜","好吃","白色蔬菜"],
  "Type":"蔬菜"
}

PUT food/_doc/3
{
  "CreateTime":"2022-06-07 13:11:11",
  "Desc":"芦笋来自国外进口的蔬菜,西餐标配",
  "Level":"中等蔬菜",
  "Name":"芦笋",
  "Price":66.11,
  "Tags":["有点贵","国外","绿色蔬菜","营养价值高"],
  "Type":"蔬菜"
}

PUT food/_doc/4
{
  "CreateTime":"2022-07-07 13:11:11",
  "Desc":"苹果 yyds 好吃 便宜 水分多 营养",
  "Level":"普通水果",
  "Name":"苹果",
  "Price":11.11,
  "Tags":["性价比","易种植","水果","营养"],
  "Type":"水果"
}

PUT food/_doc/5
{
  "CreateTime":"2022-07-09 13:11:11",
  "Desc":"榴莲 非常好吃 很贵 吃一个相当于吃一只老母鸡",
  "Level":"高级水果",
  "Name":"榴莲",
  "Price":100.11,
  "Tags":["贵","水果","营养"],
  "Type":"水果"
}

PUT food/_doc/6
{
  "CreateTime":"2022-07-08 13:11:11",
  "Desc":"猫砂王榴莲 榴莲中的战斗机",
  "Level":"高级水果",
  "Name":"猫砂王榴莲",
  "Price":300.11,
  "Tags":["超级贵","进口","水果","非常好吃"],
  "Type":"水果"
}

PUT student
{
  "mappings": {
    "properties": {
      "gender":{
        "type": "keyword"
      },
      "name":{
        "type":"text",
        "fields": {
          "keyword":{
            "type":"keyword"
          }
        }
      }
    }
  }
}

POST student/_bulk
{"create":{"_id":"01"}}
{"gender":"male","name":"Jack Lennon"}
{"create":{"_id":"02"}}
{"gender":"male","name":"Jimmy Page"}
{"create":{"_id":"03"}}
{"gender":"male","name":"Andrew Lennon"}
{"create":{"_id":"04"}}
{"gender":"male","name":"Paul McCartney"}
{"create":{"_id":"05"}}
{"gender":"female","name":"Angline X"}
{"create":{"_id":"06"}}
{"gender":"female","name":"Angel Y"}
{"create":{"_id":"07"}}
{"gender":"female","name":"Jonnan X"}
{"create":{"_id":"08"}}
{"gender":"female","name":"Monica X"}

桶聚合(Bucket aggregations)

桶聚合类似于mysql里的group by。

#普通排序
#食物 根据标签分桶 根据每个桶里面的数量进行倒叙
GET food/_search
{
  "size": 0,
  "aggs": {
    "tag_aggs": {
      "terms": {
        "field": "Tags.keyword",
        "size": 30,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}

#学生 多桶条件查询
GET student/_search
{ 
  "size": 0, 
  "aggs": {
    "a": {
      "filters": {
        "filters": {
          "male": {"match": {
            "name": "Lennon"
          }},
          "female":{"match": {
            "name": "X"
          }}
        }
      }
    }
  }
}

#聚合出水果的平均价格和所有食物的平均价格
GET food/_search
{ 
  "size":0,
  "query": {"match": {"Type.keyword": "水果"}},
  "aggs": {
    "all": {
      "global": {},
      "aggs": {
        "a": {
          "avg": {"field": "Price"}
        }
      }
    },
    "fruit":{
      "avg": {
        "field": "Price"
      }
    } 
  }
}

指标聚合(Metrics aggregations)

它是对文档进行一些权值计算(比如求所有文档某个字段求最大、最小、和、平均值),输出结果往往是文档的权值,相当于为文档添加了一些统计信息。

GET food/_search
{
  "size": 0, 
  "aggs": {
    "max_price":{
      "max": {
        "field": "Price"
      }
    },
    "min_price":{
      "min": {
        "field": "Price"
      }
    },
    "avg_price":{
      "avg": {
        "field": "Price"
      }
    },
    "sum_price":{
      "sum":{
        "field": "Price"
      }
    }
  }
}
#直接包含最大最小平均总和
GET food/_search
{
  "size": 0, 
  "aggs": {
    "price_stats": {
      "stats": {
        "field": "Price"
      }
    }
  }
}

管道聚合(Pipeline aggregations)

它对其它聚合操作的输出(桶或者桶的某些权值)及其关联指标进行聚合,而不是文档,是一种后期对每个分桶的一些计算操作。管道聚合的作用是为输出增加一些有用信息。

#现在需要计算每个食物分类中,不同档次的食品中,价格最低的食物
GET food/_search
{
  "size": 0, 
  "aggs": {
    "foodtype": {
      "terms": {
        "field": "Type.keyword",
        "size": 100
      },
      "aggs": {
        "foodlevel": {
          "terms": {
            "field": "Level.keyword",
            "size": 100
          },
          "aggs": {
            "minprice": {
              "min": {
                "field": "Price"
              }
            }
          }
        }
      }
    }
  }
}

什么是深度查询?

当我们有一个总共3W条文档的索引,三个节点对应三个分片各自储存1W条数据,每个文档插入的节点分片的时候不一定是有顺序的,而我们搜索却要根据相关度排序。

此时我们想查询5000到5100位置的文档,ES查询步骤如下。

1.协调节点向三个分片发送查询5000-5100的数据。

2.每个节点内部进行一次相关度的排序,返回前5100条文档。

3.协调节点得到了三个分片返回的总共15300条文档,再进行相关度排序,最后取5000-5100位置的文档返回。

深度查询带来的问题有两点。一是当查询的数据分页太大导致数据量过大,二是由于每个分片内部排序才返回,协调节点再排序,所以返回的结果有不准确的情况。

ES默认最大使用的page是10000,超过1W会报错。

GET food/_search
{
  "from": 10000,
  "size":1
}

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "food",
        "node" : "JkhchXCrR1SwISI528QMWA",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
        }
      }
    ],
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
      }
    }
  },
  "status" : 400
}

当我们想要搜索大于1W分页的时候ES提供了search-after和scroll。

search_after

search_after参数是无状态的,它总是根据搜索器的最新版本进行解析因此,根据索引的更新和删除,排序顺序可能会在遍历期间发生更改。

使用search_after必须要设置from=0。
这里我使用_id作为唯一值排序。
我们在返回的最后一条数据里拿到sort属性的值传入到search_after。

GET food/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {"Type.keyword": {"value": "蔬菜"}}
        }
      ]
    }
  },
  "from":0,
  "size":1,
 
  "sort": [
    {
      "_id":{"order":"desc"}
    }
  ]
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "food",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_source" : {
          "CreateTime" : "2022-06-07 13:11:11",
          "Desc" : "芦笋来自国外进口的蔬菜,西餐标配",
          "Level" : "中等蔬菜",
          "Name" : "芦笋",
          "Price" : 66.11,
          "Tags" : [
            "有点贵",
            "国外",
            "绿色蔬菜",
            "营养价值高"
          ],
          "Type" : "蔬菜"
        },
        "sort" : [
          "3" #使用该值作为pagenum传入下次查询
        ]
      }
    ]
  }
}

第二次查询
GET food/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {"Type.keyword": {"value": "蔬菜"}}
        }
      ]
    }
  },
  "from":0,
  "size":1,
  "search_after":["3"],
  "sort": [
    {
      "_id":{"order":"desc"}
    }
  ]
}

scroll

scroll会当前一个当前索引的快照,所以当新数据插入不会被搜索到,不具备实时性。它会记录一个下标,下次查询时用到。

GET _search/scroll
{
  "scroll":"1m",
  "scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFlpqWHF2eTFWUWVhUlBMZmFrY0ZRc2cAAAAAAABo9BZKa2hjaFhDclIxU3dJU0k1MjhRTVdB"
}

什么是并发控制?

ES提供了两种乐观锁机制来控制并发安全。

内部版本控制

  • if_seq_no +if_primary_term

使用外部版本(使用其他数据库作为主要数据存储)

  • version + version_type = external


什么是嵌套和父子对象?

嵌套

下图中创建一个数组对象,ES会将数组对象解析为author.country:[a,b],author.name:[托尼斯塔克,孙悟空]。

执行下面的查询就导致了居然能查出来。

创建这种嵌套对象时需要预先设置为nested类型的数据,ES就会解析为对象。

父子对象

#建立索引
PUT mytest
{
  "mappings": {
    "properties": {
      "father_son_relation":{
        "type":"join", #定义为父子关系文档
        "relations":{
          "father":"son" # 父 :子
        }
      },
      "name":{
        "type":"keyword"
      }
    }
  }
}

#建立父文档
PUT mytest/_doc/01?routing=1#指定文档建立的分片
{
  "name":"father",
  "father_son_relation":{
    "name":"father" #指定为父文档
  }
}

#建立子文档
PUT mytest/_doc/02?routing=1 #指定文档建立的分片 必须和父文档一致
{
  "name":"son",
  "father_son_relation":{
    "parent":"01", #指定父文档的id
    "name":"son" #指定为子文档
  }
}

#通过父文档查询子文档
GET mytest/_search
{
  "query":{
    "has_parent": {
      "parent_type": "father",
      "query": {
        "match_all": {}
      }
    }
  }
}

#通过子文档查询父文档
GET mytest/_search
{
  "query":{
    "has_child": {
      "type": "son",
      "query": {
        "match_all": {}
      }
    }
  }
}

什么是索引模板?

顾名思义呗,提前根据mapping和setting设置好一个模板,创建索引得时候ES会根据索引的名字去匹配模板。

PUT _template/my_template #如果创建索引又设置了mapping或者setting,优先用自定义的
{
   "index_patterns":[*], #匹配索引名
   "order":1, #如果还有其他模板匹配到了索引, order更大, 就会匹配order大的。    
   "settings":{
        "number_of_shards":1,
        "number_of_replicas":2,
  }
}

#动态模板
PUT my_index
{
  "mappings":{
    "dynamic_templates":{
        "full_name":{
          "path_match":"*.name",
          "mapping":{
            "type":"text",
            "copy_to":"full_name"
         }
       }
    }
  }
}

什么是Pipeline Ingest?

Pipeline Ingest是ES的预处理机制,上文节点类型中提到的Ingest节点就是使用pipeline ingest对数据进行预先处理,最后才写入到数据节点。但ingest节点并不是真正参与到集群工作中的。

Pipeline ingest的作用就是对数据进行预先的清洗,获得最终想要的数据才真正写入到数据节点,例如我们在写入日期格式的时候,想要把2023/02/22转化为2023-02-22格式,或者希望在原数据上加入新字段等。

#新建一个pipeline ingest
PUT _ingest/pipeline/my_ingest
{
  "description": "my test ingest",
  "processors": [
    { 
      "set": {   #新增字段              
      "field": "country",
      "value": "china"
      }
    },
    {
      "lowercase": { #小写处理
        "field": "name"
      }
    },
    {
      "drop": { 丢弃
        "if": "ctx.age > 35" if条件
      }
    }
  ]
}

#创建数据 指定pipeline
PUT ingest_test/_bulk?pipeline=my_ingest
{"create":{"_id":"01"}}
{"name":"jimmy","age":23}
{"create":{"_id":"02"}}
{"name":"page","age":46}

什么是Reindex和UpdateByQuery?

reindex是ES提供用于索引重建的api。

ES当建立索引后,mapping是无法更改的。如果我们增加分片、修改字段类型,修改字段名等,在原索引上是无法做到的。

reindex就是新建一个索引,将源索引的文档都迁移到目标索引里。值得注意的是reindex需要源索引开启了_source。

步骤如下:

1.建立源索引,将age设置为keyword,主分片设置为3个。

2.创建目标索引

3.reindex重新

重新索引的过程是可以进行优化的。

1.写入时我们不需要着急进行读操作,完全可以关掉副本。因为索引写入主分片时,也要写入副本分片,这也是浪费资源的。副本分片的数量是可以更改的。

2.关闭掉refresh操作,refresh操作会将buffer区的文档写入到缓冲中,虽然消耗资源小,但1s的间隔也太短了。

3.reindex从源索引查询到目标索引底层是根据scroll查询的,slices可以而是将scroll查询分片,分成多个并行执行的任务。也可以设置每次查询的数量大小。

4.reindex的version_type支持internal和external,internal会将源索引中id和type相同的直接覆盖目标索引,external只会复制缺失的文档或version比源索引低的文档。op_type的create只会复制缺失的文档。

updateByQuery

如果我们在mapping里新增了字段,想用这个查询新增字段之前的文档,就需要更新字段到旧文档之中,不然就查询不到。

#设置mapping的额外字段
PUT test 
{
  "mappings": {
    "properties": {
      "student":{
        "type":"nested",
         "properties": {
            "name":{
              "type":"text",
              "fields":{
                "otherField1":{
                  "type":"text"
                }
              }
          }
        }
      }
    }
  }
}

PUT test/_doc/01
{
  "name":"jimmy"
}

#新增第二个额外字段
PUT test/_mapping
{
   "properties": {
      "student":{
        "type":"nested",
         "properties": {
            "name":{
              "type":"text",
              "fields":{
                "otherField1":{
                  "type":"text"
                },
                "otherField2":{
                  "type":"text"
                }
              }
          }
        }
      }
    }
}

PUT test/_doc/02
{
  "name":"spider man"
}

#更新旧文档
POST test/_update_by_query

GET test/_search
{
  "query": {
    "nested": {
      "path": "student",
      "query": {
        "bool": {
          "filter":{
          "term": {
            "student.name.otherField2": "jimmy"
          }
      }
        }
    }
  }
}
}

什么是suggester?

suggester就是根据搜索词进行联想、补全。

term

suggest_mode分为missing、always、popular。

missing : default. 当 term 不存在时, 给出建议。

always : 无论 term 是否存在, 都给出建议。

popular : 给出词频更高的建议。

GET test/_search
{
  "suggest": {
    "YOUR_SUGGESTION": {
      "text": "abouts shits",
      "term": {
        "suggest_mode":"missing",
        "field": "author"
      }
    }
  }
}

phrase

GET test/_search
{
  "suggest": {
    "YOUR_SUGGESTION": {
      "text": "abouts shits",
      "phrase": {
        "field": "author"
      }
    }
  }
}

Completion

#自动补全 需要设置字段类型为completion
PUT test
{
  "mappings": {
    "properties": {
      "author":{
        "type": "completion"
      }
    }
  }
}

GET test/_search
{
  "suggest": {
    "YOUR_SUGGESTION": {
      "text": "about",
      "completion": {
        "field": "author"
      }
    }
  }
}

什么是读优化?

  • 不需要排序,就用context查询

  • 能用term就不用match_phrase

  • 不需要读整篇文档就禁掉_source,用store保存需要的字段。


什么是写优化?

  • 加大translog flush间隔,目的是降低iops,writeblock (可靠性降低)

  • 加大index refresh间隔,除了降低I/O,更重要的是降低segment merge频率

  • 调整bulk请求(批处理)

  • 优化磁盘间的任务均匀情况,将shard尽量均匀分布到物理机各个磁盘

  • 优化节点间的任务分布,将任务尽量均匀地发到各节点

  • 优化Lucene层建立索引的过程,目的是降低CPU占用率,例如,禁用_all字段


什么是alias?

#也可以给字段弄个别名
PUT test
{
  "aliases":{ 
    "哈哈":{},
    "呵呵":{}
  }
}

PUT test/_doc/01
{
  "name":"jimmy"
}

GET 呵呵/_search
{
  
}

什么是hot&warm架构?

一些索引查询和写入的操作频繁,属于热点数据,我们就可以用hot。

#yml配置节点类型
node.attr.box_type: hot

hot节点Indexing对CPU和IO都有很高的要求。所以需要使用高配置的机器,存储的性能要好。建议使用SSD。

PUT logs-2021-01-01
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 0,
    "index.routing.allocation.require.my_node_type": "hot"
  }
}

warm节点可以存放一些老数据,这些老数据查询很少,不会再更新,适合冷处理,只需要大容量低配置的机器就可以用了。

PUT logs-2021-05-01/_settings
{
  "index.routing.allocation.require.my_node_type": "warm"
}

ES的节点可能分布在不同的机架,当一个机架断电,可能会同时丢失几个节点。如果一个索引相同的主分片和副本分片,同时在这个机架上,就有可能导致数据的丢失,通过Rack Awareness的机制,就可以尽可能避免将同一个索引的主副分片同时分配在一个机架的节点上。

#标记一个rack 1
bin/elasticsearch -E node.name=node1 -E cluster.name=test - E path.data=node1_data -E node.attr.my_rack_id=rack1
#标记一个rack 2
bin/elasticsearch -E node.name=node2 -E cluster.name=test - E path.data=node2_data -E node.attr.my_rack_id=rack2

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.alocation.awareness.attributes": "my_rack_id"
  }
}

PUT my_index1
{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 2
  }
}

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值