Elasticsearch 7.8 索引创建 / 数据检索

最新推荐文章于 2024-07-17 10:59:49 发布

张耘华

最新推荐文章于 2024-07-17 10:59:49 发布

阅读量716

点赞数

分类专栏： elasticsearch

原文链接：https://segmentfault.com/a/1190000018661035?utm_source=tag-newest

版权

elasticsearch 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Elasticsearch 索引创建 / 数据检索

全文检索全文索引全文搜索 elastic-search elasticsearch

发布于 2019-03-26

es 6.0 开始不推荐一个index下多个type的模式，并且会在 7.0 中完全移除。在 6.0 的index下是无法创建多个type的，type带来的字段类型冲突和检索效率下降的问题，导致了type会被移除。（5.x到6.x）
_all字段也被舍弃了，使用 copy_to自定义联合字段。（5.x到6.x）
type:text/keyword 来决定是否分词，index: true/false决定是否索引（2.x到5.x）
analyzer来单独设定分词器（2.x到5.x）

创建索引

先把 ik 装上，重启服务。

# 使用 elasticsearch-plugin 安装
elasticsearch-plugin install \
https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.6.2/elasticsearch-analysis-ik-6.6.2.zip

文档字段类型参考：
https://www.elastic.co/guide/...

文档字段其他参数参考（不同字段类型可能会有相应的特征属性）：
https://www.elastic.co/guide/...

我们新建一个名news的索引：

设定默认分词器为ik分词器用来处理中文
使用默认名 _doc 定义 type
故意关闭_source存储（用来验证 store 选项）
title 不存储 author 不分词 content 存储

_source字段的含义可以看下这篇博文：https://blog.csdn.net/napoay/...


PUT /news
{
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "index": {
            "analysis.analyzer.default.type" : "ik_smart"
        }
    },
    "mappings": {
        
            "properties": {
                "news_id": {
                    "type": "integer",
                    "index": true
                },
                "title": {
                    "type": "text",
                    "store": false
                },
                "author": {
                    "type": "keyword"
                },
                "content": {
                    "type": "text",
                    "store": true
                },
                "created_at": {
                    "type": "date",
                    "format": "yyyy-MM-dd hh:mm:ss"
                }
            }
        
    }
}
//上面的方式测试没有被分词，下面的方式可以
PUT testuser
{
  "settings":{
    "number_of_shards": 3,
    "number_of_replicas": 1
  },
  "mappings": { 
    "properties": {
      "user": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      },
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      },
      "desc": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_max_word"
      }
    }
  }
}

验证分词器是否生效

# 验证分词插件是否生效
GET /_analyze
{
    "analyzer": "ik_smart",
    "text": "我热爱祖国"
}
GET /_analyze
{
    "analyzer": "ik_max_word",
    "text": "我热爱祖国"
}

# 索引的默认分词器
GET /news/_analyze
{
    "text": "我热爱祖国！"
}

# 指定字段 分词器将根据字段属性做相应分词处理
# author 为 keyword 是不会做分词处理
GET /news/_analyze
{
    "field": "author"
    "text": "我热爱祖国！"
}
# title 的分词结果
GET /news/_analyze
{
    "field": "title"
    "text": "我热爱祖国！"
}

添加文档

用于演示，后面的查询会以这些文档为例。

POST /news/_doc
{
    "news_id": 1,
    "title": "我们一起学旺叫",
    "author": "才华横溢王大猫",
    "content": "我们一起学旺叫，一起旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，我的尾巴可劲儿摇",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 2,
    "title": "我们一起学猫叫",
    "author": "王大猫不会被分词",
    "content": "我们一起学猫叫，还是旺旺旺旺旺，在你面撒个娇，哎呦旺旺旺旺旺，我的尾巴可劲儿摇",
    "created_at": "2019-03-26 11:55:20"
}
{
    "news_id": 3,
    "title": "实在编不出来了",
    "author": "王大猫",
    "content": "实在编不出来了，随便写点数据做测试吧，旺旺旺",
    "created_at": "2019-03-26 11:55:20"
}

检索数据

GET /news/_doc/_search 为查询news下_doc的文档的接口，我们用 restApi+DSL演示

match_all

即无检索条件获取全部数据

#无条件分页检索 以 news_id 排序
GET /news/_doc/_search
{
    "query": {
        "match_all": {}
    },
    "from": 0,
    "size": 2,
    "sort": {
        "news_id": "desc"
    }
}

因为我们关掉了_source字段，即 ES 只会对数据建立倒排索引，不会存储其原数据，所以结果里没有相关文档原数据内容。关掉的原因主要是想演示highlight机制。

elasticsearch 关键词查询-实现like查询

创建索引 es_test_index

PUT  127.0.0.1:9200/es_test_index
{
    "order": 0,
    "index_patterns": [
        "es_test_index"
    ],
    "settings": {
        "index": {
            "max_result_window": "30000",
            "refresh_interval": "60s",
            "number_of_shards": "3",
            "number_of_replicas": "1"
        }
    },
    "mappings": {
        "logs": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "search_word": {
                    "type": "keyword"
                }
            }
        }
    }
}

方式一

{
    "profile":true,
    "from":0,
    "size":100,
    "query":{
        "query_string":{
            "query":"search_word:(*中国* NOT *美国* AND *VIP* AND *经济* OR *金融*)",
            "default_operator":"and"
        }
    }
}

采用*通配符的方式，相当于wildcard query，只是query_string能支持查询多个关键词，并且可以用 AND OR  NOT进行连接，会更加灵活。

{
    "query": {
        "wildcard" : { "search_word" : "*中国*" }
    }
}

match

普通检索，很多文章都说match查询会对查询内容进行分词，其实并不完全正确，match查询也要看检索的字段type类型，如果字段类型本身就是不分词的keyword(not_analyzed)，那match就等同于term查询了。

我们可以通过分词器explain一下字段会被如何处理:

GET /news/_analyze
{
    "filed": "title",
    "text": "我会被如何处理呢？分词？不分词？"
}

查询

GET /news/_doc/_search
{
    "query": {
        "match": {
            "title": "我们会被分词"
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

通过highlight我们可以将检索到的关键词以高亮的方式返回上下文内容，如果关闭了_source就得开启字段的store属性存储字段的原数据，这样才能做高亮处理，不然没有原内容了，也就没办法高亮关键词了

multi_match

对多个字段进行检索，比如我想查询title或content中有我们关键词的文档，如下即可：

GET /news/_doc/_search
{
    "query": {
        "multi_match": {
            "query": "我们是好人",
            "fields": ["title", "content"]
        }
    },
    "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

match_phrase

这个需要认证理解一下，match_phrase，短语查询，何为短语查询呢？简单来说即被查询的文档字段中要包含查询内容被分词解析后的所有关键词，且关键词在文档中的分布距离差offset要满足slop设定的阈值。slop表征可以将关键词平移几次来满足在文档中的分布，如果slop足够的大，那么即便所有关键词在文档中分布的很离散，也是可以通过平移满足的。

content: i love china
match_phrase: i china
slop: 0//查不到 需要将 i china 的 china 关键词 slop 1 后变为 i - china 才能满足
slop: 1//查得到

测试实例

# 先看下查询会被如何解析分词
GET /news/_analyze
{
    "field": "title",
    "text": "我们学"
}
# reponse
{
    "tokens": [
        {
            "token": "我们",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "学",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        }
    ]
}

# 再看下某文档的title是被怎样建立倒排索引的
GET /news/_analyze
{
    "field": "title",
    "text": "我们一起学旺叫"
}
# reponse
{
    "tokens": [
        {
            "token": "我们",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "一起",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "学",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 2
        },
        ...
    ]
}

注意position字段，只有slop的阈值大于两个不相邻的关键词的position差时，才能满足平移关键词至查询内容短语分布的位置条件。

查询内容被分词为：["我们", "学"]，而文档中["我们", "学"]两个关键字的距离为 1，所以，slop必须大于等于1，此文档才能被查询到。

使用查询短语模式：

GET /news/_doc/_search
{
    "query": {
        "match_phrase": {
            "title": {
                "query": "我们学",
                "slop": 1
            }
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

查询结果：

{
            ...
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "if-CuGkBddO9SrfVBoil",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>我们</em>一起<em>学</em>猫叫"
                    ]
                }
            },
            {
                "_index": "news",
                "_type": "_doc",
                "_id": "iP-AuGkBddO9SrfVOIg3",
                "_score": 0.37229446,
                "highlight": {
                    "title": [
                        "<em>我们</em>一起<em>学</em>旺叫"
                    ]
                }
            }
            ...
}

term

term要理解只是不对查询条件分词，作为一个关键词去检索索引。但文档存储时字段是否被分词建立索引由_mappings时设定了。可能有["我们", "一起"]两个索引，但并没有["我们一起"]这个索引，查询不到。keyword类型的字段则存储时不分词，建立完整索引，查询时也不会对查询条件分词，是强一致性的。

GET /news/_doc/_search
{
    "query": {
        "term": {
           "title": "我们一起" 
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

terms

terms则是给定多个关键词，就好比人工分词

{
    "query": {
        "terms": {
           "title": ["我们", "一起"]
        }
    },
    "highlight": {
        "fields": {
            "title": {}
        }
    }
}

满足["我们", "一起"]任意关键字的文档都能被检索到。

wildcard

shell通配符查询: ? 一个字符 * 多个字符，查询倒排索引中符合pattern的关键词。

查询有两个字符的关键词的文档

{
   "query": {
       "wildcard": {
               "title": "??"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

prefix

前缀查询，查询倒排索引中符合pattern的关键词。

{
   "query": {
       "prefix": {
               "title": "我"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

regexp

正则表达式查询，查询倒排索引中符合pattern的关键词。

查询含有2 ~ 3 个字符的关键词的文档

{
   "query": {
       "regexp": {
               "title": ".{2,3}"
       }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

bool

布尔查询通过 bool链接多个查询组合：
must：必须全满足
must_not：必须全不满足
should：满足一个即可

{
   "query": {
        "bool": {
            "must": {
                "match": {
                    "title": "绝对要有我们"
                }
            },
            "must_not": {
                "term": {
                    "title": "绝对不能有我"
                }
            },
            "should": [
                {
                    "match": {
                        "content": "我们"
                    }
                },
                {
                    "multi_match": {
                        "query": "满足",
                        "fields": ["title", "content"]
                    }
                },
                {
                    "match_phrase": {
                        "title": "一个即可"
                    }
                }
            ],
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2019-01-05 12:00:00"
                    }
                }
            }
        }
   },
   "highlight": {
        "fields": {
            "title": {},
            "content": {}
        }
    }
}

filter

filter 通常情况下会配合match之类的使用，对符合查询条件的数据进行过滤。

{
   "query": {
        "bool": {
            "must": {
                "match_all": {}
            },
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
        }
   }
}

或者单独使用

{
   "query": {
       "constant_score" : {
            "filter": {
                "range": {
                    "created_at": {
                        "lt": "2020-12-05 12:00:00",
                        "gt": "2017-12-05 12:00:00"
                    }
                }
            }
       }
   }
}

多个过滤条件：2017-12-05 12:00:00 <= created_at < 2020-12-05 12:00:00 and news_id >= 2

{
   "query": {
       "constant_score" : {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "created_at": {
                                    "lt": "2020-12-05 12:00:00",
                                    "gt": "2017-12-05 12:00:00"
                                }
                            }
                        },
                        {
                            "range": {
                                "news_id": {
                                    "gte": 2
                                }
                            }
                        }
                    ]
                }
            }
       }
   }