2、es---搜索引擎

最新推荐文章于 2024-06-24 10:33:44 发布

置顶 sunxj1222

最新推荐文章于 2024-06-24 10:33:44 发布

阅读量337

点赞数

分类专栏： es

本文链接：https://blog.csdn.net/sunxj1222/article/details/106208213

版权

es 专栏收录该内容

12 篇文章 3 订阅

订阅专栏

一、get /_search 结果详情

1、结果

GET /_search

{
"took": 6,
"timed_out": false,
"_shards": {
"total": 6,
"successful": 6,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 1,
"hits": [
{
"_index": ".kibana",
"_type": "config",
"_id": "5.2.0",
"_score": 1,
"_source": {
"buildNum": 14695
}
}
]
}
}

2、解析

took：整个搜索请求花费了多少毫秒

hits.total：本次搜索，返回了几条结果
hits.max_score：本次搜索的所有结果中，最大的相关度分数是多少，每一条document对于search的相关度，越相关，_score分数越大，排位越靠前
hits.hits：默认查询前10条数据，完整数据，_score降序排序

shards：shards fail的条件（primary和replica全部挂掉），不影响其他shard。默认情况下来说，一个搜索请求，会打到一个index的所有primary shard上去，当然了，每个primary shard都可能会有一个或多个replic shard，所以请求也可以到primary shard的其中一个replica shard上去。

3、timeout机制

a、设置timeout：GET /_search?timeout=10m
b、timeout机制：指定每个shard，就只能在timeout时间范围内，将搜索到的部分数据（也可能搜索到了全部数据），直接返给client程序，而不是等到所有数据全部都搜索出来再返回给client

二、multi-index和multi-type搜索模式

1、告诉你如何一次性搜索多个index和多个type下的数据

2、事例：

/_search：所有索引，所有type下的所有数据都搜索出来
/index1/_search：指定一个index，搜索其下所有type的数据
/index1,index2/_search：同时搜索两个index下的数据
/*1,*2/_search：按照通配符去匹配多个索引
/index1/type1/_search：搜索一个index下指定的type的数据
/index1/type1,type2/_search：可以搜索一个index下多个type的数据
/index1,index2/type1,type2/_search：搜索多个index下的多个type的数据
/_all/type1,type2/_search：_all，可以代表搜索所有index下的指定type的数据

三、es进行分页搜索的语法

1、实例：GET /_search?size=10&from=20

四、query string

1、基础语法

GET /test_index/test_type/_search?q=test_field:test
GET /test_index/test_type/_search?q=+test_field:test
GET /test_index/test_type/_search?q=-test_field:test

一个是掌握q=field:search content的语法，还有一个是掌握+和-的含义

2、_all metadata的原理和作用

a、es中的_all元数据，在建立索引的时候，我们插入一条document，它里面包含了多个field，此时，es会自动将多个field的值，全部用字符串的方式串联起来，变成一个长的字符串，作为_all field的值，同时建立索引

b、例：

{
"name": "jack",
"age": 26,
"email": "jack@sina.com",
"address": "guamgzhou"
}

--------->_all field = jack 26 jack@sina.com guamgzhou
"jack 26 jack@sina.com guangzhou"，作为这一条document的_all field的值，同时进行分词后建立对应的倒排索引

五：mapping

1、dynamic mapping：自动为我们建立index，创建type，以及type对应的mapping，mapping中包含了每个field对应的数据类型，以及如何分词等设置

2、查询mapping结构：GET /index/_mapping/type

{
"website": {
"mappings": {
"article": {
"properties": {
"author_id": {
"type": "long"
}，
"post_date": {
"type": "date"
},
}
}
}
}
}

3、es用不同的field进行搜索，搜索到的结果可能不一致，因为es自动建立mapping的时候，设置了不同的field不同的data type。不同的data type的分词、搜索等行为是不一样的。所以出现了_all field和post_date field的搜索表现完全不一样。

六、精准匹配与全文搜索

1、精准匹配：exact value

doc 的field与搜索的关键词完全匹配才能搜索出来

2、全文检索：full text

不是说单纯的只是匹配完整的一个值，而是可以对值进行拆分词语后（分词）进行匹配，也可以通过缩写、时态、大小写、同义词等进行匹配

如：

（1）缩写 vs. 全程：cn vs. china
（2）格式转化：like liked likes
（3）大小写：Tom vs tom
（4）同义词：like vs love

七、倒排索引原理

1、将内容分词--->初步建立倒排索引--->normalization

2、normalization：对拆分出的各个单词进行相应的处理，以提升后面搜索的时候能够搜索到相关联的文档的概率如：单复数的转换，同义词的转换，大小写的转换

八、分词器

1、什么是分词器？

切分词语，normalization（提升recall召回率）

2、分词步骤

a、character filter：在一段文本进行分词之前，先进行预处理，比如说最常见的就是，过滤html标签（<span>hello<span> --> hello），& --> and（I&you --> I and you）
b、tokenizer：分词，hello you and me --> hello, you, and, me
c、token filter：lowercase，stop word，synonymom，dogs --> dog，liked --> like，Tom --> tom，a/the/an --> 干掉，mother --> mom，small --> little

3、分词器种类

Set the shape to semi-transparent by calling set_trans(5)

standard analyzer（默认）：set, the, shape, to, semi, transparent, by, calling, set_trans, 5
simple analyzer：set, the, shape, to, semi, transparent, by, calling, set, trans
whitespace analyzer：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
language analyzer（特定的语言的分词器，比如说，english，英语分词器）：set, shape, semi, transpar, call, set_tran, 5

九、query string

1、查询的内容就是query string

2、默认情况下，es会使用它对应的field建立倒排索引时相同的分词器去对query string进行分词

a、每个field都会建立倒排索引，特殊field：_all

b、不同类型的field，可能有的就是full text，有的就是exact value

date：exact value
_all：full text

3、query string对exact value和full text的区别对待

4、测试分词器

GET /_analyze
{
"analyzer": "standard",
"text": "Text to analyze"
}

结果：{tokens：[{token1},{token2}]}--------->每个token就是被拆分的一个词语

十、综上总结

（1）往es里面直接插入数据，es会自动建立索引，同时建立type以及对应的mapping
（2）mapping中就自动定义了每个field的数据类型
（3）不同的数据类型（比如说text和date），可能有的是exact value，有的是full text
（4）exact value，在建立倒排索引的时候，分词的时候，是将整个值一起作为一个关键词建立到倒排索引中的；full text，会经历各种各样的处理，分词，normaliztion（时态转换，同义词转换，大小写转换），才会建立到倒排索引中
（5）同时呢，exact value和full text类型的field就决定了，在一个搜索过来的时候，对exact value field或者是full text field进行搜索的行为也是不一样的，会跟建立倒排索引的行为保持一致；比如说exact value搜索的时候，就是直接按照整个值进行匹配，full text query string，也会进行分词和normalization再去倒排索引中去搜索
（6）可以用es的dynamic mapping，让其自动建立mapping，包括自动设置数据类型；也可以提前手动创建index和type的mapping，自己对各个field进行设置，包括数据类型，包括索引行为，包括分词器，等等

mapping，就是index的type的元数据，每个type都有一个自己的mapping，决定了数据类型，建立倒排索引的行为，还有进行搜索的行为

十一、mapping对应的数据类型

1、核心的数据类型
string/text
byte，short，integer，long
float，double
boolean
date

2、dynamic mapping
true or false   -->   boolean
123       -->   long
123.45       -->   double
2017-01-01   -->   date
"hello world"   -->   string/text

3、查看mapping
GET /index/_mapping/type

十二、string类型数据是否分词

1、分词类型

analyzed
not_analyzed------>field type为keyword，keyword不会进行分词
no---->不能搜索

2、建立mapping

注：只能创建index时手动建立mapping，或者新增field mapping，但是不能update field mapping

a、建立mapping

PUT /website
{
"mappings": {
"article": {
"properties": {
"author_id": {
"type": "long"
},
"title": {
"type": "text",
"analyzer": "english"
},
"content": {
"type": "text"
},
"post_date": {
"type": "date"
},
"publisher_id": {
"type": "text",
"index": "not_analyzed"
}
}
}
}
}

b、添加field

PUT /website/_mapping/article
{
"properties" : {
"new_field" : {
"type" : "string",
"index": "not_analyzed"
}
}
}

3、测试mapping

GET /website/_analyze
{
"field": "content",
"text": "my-dogs"
}

结果：{tokens：[{token1},{token2}]}--------->每个token就是被拆分的一个词语

十二、特殊类型的field

1、multivalue field
{ "tags": [ "tag1", "tag2" ]}
建立索引时与string是一样的，数据类型不能混

2、empty field
null，[]，[null]

3、object field
PUT /company/employee/1
{
"address": {
"country": "china",
"province": "guangdong",
"city": "guangzhou"
},
"name": "jack",
"age": 27,
"join_date": "2017-01-01"
}

十三、query DSL

1、语法

bool：多条件查询，里面放must、should、must_not、filter

must：必须匹配
should：可以匹配也可以不匹配
must_not：必须不能匹配

match_all：查询所有doc
match：条件匹配
multi match：匹配多个field
range query：范围查询
term query：不会对查询条件进行拆分（整个内容去匹配）

2、filter性能

filter：不需要计算相关度分数，不需要按照相关度分数进行排序

query：相反，要计算相关度分数，按照分数进行排序

3、验证搜索是否合法

GET /index/type/_validate/query?explain

4、定制排序规则

GET /company/employee/_search
{
"query": {
"constant_score": {
"filter": {
"range": {
"age": {
"gte": 30
}
}
}
}
},
"sort": [
{
"join_date": {
"order": "asc"
}
}
]
}

5、如果对一个string field进行排序，结果往往不准确，因为分词后是多个单词，再排序就不是我们想要的结果了

解决：将一个string field建立两次索引，一个分词，用来进行搜索；一个不分词，用来进行排序

例如：

建立mapping：

PUT /website
{
"mappings": {
"article": {
"properties": {
"title": {
"type": "text",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
},
"fielddata": true
},
"content": {
"type": "text"
},
"post_date": {
"type": "date"
},
"author_id": {
"type": "long"
}
}
}
}
}

查询：

GET /website/article/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"title.raw": {
"order": "desc"
}
}
]
}

十四、TF/IDF算法

Elasticsearch使用的是 term frequency/inverse document frequency算法，简称为TF/IDF算法

1、Term frequency：搜索文本中的各个词条在field文本中出现了多少次，出现次数越多，就越相关

2、Inverse document frequency：搜索文本中的各个词条在整个索引的所有文档中出现了多少次，出现的次数越多，就越不相关

3、Field-length norm：field长度，field越长，相关度越弱

4、_score是如何被计算出来的： GET /test_index/test_type/_search?explain

5、分析一个document是如何被匹配上的： GET /test_index/test_type/6/_explain

十五、正排索引和倒排索引（doc values）

1、搜索的时候，要依靠倒排索引；排序的时候，需要依靠正排索引，所谓的正排索引，其实就是doc values

2、倒排索引

doc1: hello world you and me
doc2: hi, world, how are you

word doc1 doc2

hello *
world * *
you * *
and *
me *
hi *
how *
are *

3、正排索引

doc1: { "name": "jack", "age": 27 }
doc2: { "name": "tom", "age": 30 }

document name age

doc1 jack 27
doc2 tom 30

十六、搜索先关参数

1、preference：决定了哪些shard会被用来执行搜索操作
可用配置：_primary, _primary_first, _local, _only_node:xyz, _prefer_node:xyz, _shards:2,3

2、timeout：主要就是限定在一定时间内，将部分获取到的数据直接返回，避免查询耗时过长

3、routing：document文档路由，_id路由，routing=user_id（默认是doc_id），这样的话可以让同一个user对应的数据到一个shard上去

4、search_type

default：query_then_fetch
dfs_query_then_fetch，可以提升revelance sort精准度

5、bouncing results问题：两个document排序，field值相同；在不同的shard上，可能排序不同；每次请求轮询打到不同的replica shard上；每次页面上看到的搜索结果的排序都不一样。这就是bouncing result，也就是跳跃的结果。

解决：preference设置为一个字符串，比如说user_id，让每个user每次搜索的时候，都使用同一个replica shard去执行，就不会看到bouncing results了

十七、scroll技术

1、使用scoll滚动搜索，可以先搜索一批数据，然后下次再搜索一批数据，以此类推，直到搜索出全部的数据来

2、scoll搜索会在第一次搜索的时候，保存一个当时的视图快照，之后只会基于该旧的视图快照提供数据搜索，如果这个期间数据变更，是不会让用户看到的。采用基于_doc进行排序的方式，性能较高

3、每次发送scroll请求，我们还需要指定一个scoll参数，指定一个时间窗口，每次搜索请求只要在这个时间窗口内能完成就可以了

4、例如：

GET /test_index/test_type/_search?scroll=1m
{
"query": {
"match_all": {}
},
"sort": [ "_doc" ],
"size": 3------------------------->每次查询出的个数

查询出的结果会有个_scroll_id，下次查询再根据这个_scroll_id进行查询

GET /_search/scroll
{
"scroll": "1m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAACxeFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYBY0b25zVFlWWlRqR3ZJajlfc3BXejJ3AAAAAAAALF8WNG9uc1RZVlpUakd2SWo5X3NwV3oydwAAAAAAACxhFjRvbnNUWVZaVGpHdklqOV9zcFd6MncAAAAAAAAsYhY0b25zVFlWWlRqR3ZJajlfc3BXejJ3"
}
}

sunxj1222

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2、es---搜索引擎

一、get /_search 结果详情1、结果GET /_search{ "took": 6, "timed_out": false, "_shards": { "total": 6, "successful": 6, "failed": 0 }, "hits": { "total": 10, "max_score": 1, "hits": [ { "_index": ".kibana", ...
复制链接

扫一扫

专栏目录