elasticsearch权威指南笔记整理

最新推荐文章于 2024-05-09 15:39:33 发布

luoxingyu500

最新推荐文章于 2024-05-09 15:39:33 发布

阅读量382

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch 笔记基本操作

本文链接：https://blog.csdn.net/luoxingyu500/article/details/89466800

版权

elasticsearch 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Lucene

1.为何搜索如此之快?

倒排索引:
Apache Lucene 将写入索引的所有信息组织成一种名为倒排索引（inverted index）的结
构。该结构是一种将词项映射到文档的数据结构，其工作方式与传统的关系数据库不同，
你大可以认为倒排索引是面向词项而不是面向文档的

(文档中的词项会被记录成索引并标明出处,引用频率等,从而更快的被检索到)

2.分析器

文档是如何转化成倒排索引,而查询串又是如何转化成为可用于搜索的词项的,这个转换的过程称之为分析.
专门由分析器构成:分词器,过滤器,字符映射器.


----ik分词器-----语法------mapping中添加某索引的分词-----
{
    "mappings": {
        "user": {
            "properties": {
                "name": {
                    "type":"text",
                    "analyzer": "ik_smart",
                    "search_analyzer": "ik_smart"    
                }
            }
        }
    }
}
--------------------- 例如如下-------------

PUT mytest_ik
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "ik": {
          "tokenizer": "ik_max_word"
        }
      }
    }
  },
  "mappings": {
    "test":{
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}
--测试高亮
-----------------------语法--------------
  "highlight": {
    "pre_tags": ["<span style = 'color:red'>"],
    "post_tags": ["</span>"],
    "fields": {"content": {}}
  }
-----------------------------------------------  
  
GET /mytest_ik/test/_search
{
  "query": {
    "match": {
      "content": "上海"
    }
  },
  "highlight": {
    "pre_tags": ["<span style = 'color:red'>"],
    "post_tags": ["</span>"],
    "fields": {"content": {}}
  }
}

分词器类型及区别
ik_max_word会将文本做最细粒度的拆分； 
ik_smart 会做最粗粒度的拆分。

综上，同样是对“这是一个对分词器的测试”进行分词，不同的分词器分词结果不同： 
ik_max_word：这是/一个/一/个/对分/分词器/分词/词/器/测试 
ik_smart：这是/一个/分词器/测试 
standard：这/是/一/个/对/分/词/器/的/测/试 


-----ngram分词器语法-------------------
POST mytest_ik/test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ngram_analyzer": {
          "tokenizer": "ngram_tokenizer"
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 30,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
        "test": {
            "properties": {
                "content": {
                    "type": "string",
                    "analyzer": "ngram_analyzer"
                }
            }
        }
    }
}

------------区别:ngram为模糊匹配分词器------

3.操作符

and
or
not
+  :表示+号后面的词项必须出现在文档中
-  :表示-号后面的词项不能出现在文档中
例如 +lucece-elasticsearch 含有lucece且没有elasticsearch的文档.
?  匹配任意一个字符
*  匹配多个字符
  //出于性能的考虑通配符不能作为词项的第一个字符出现
^  :对词项加权,表示重要程度.默认为1
[} : 例如 price:[10.00 To 15.00}  表示大于等于10.00小于15.00
\  :转义字符
  ------模糊查询----
  fuzzy
  proximity 
  ~  紧随一个数字 :表示原始词项与近视词项间允许的最大距离;或词项之间允许的词项数
      用在一个词项后 eg:  write~2  表示writes也可以被匹配到
      用在短语后     eg: tittle:"write china"~2  表示write in china 也被匹配到.

elasticsearch

a.节点工作原理:
节点启动时根据配置文件在集群中的名称广播找到其他节点并连接.
b.文档评分
   --越罕见的词项被匹配到文档得分越高.
   --越少越短的词项,越高的评分.
   --权重越高得分越高.

000.配置相关

1.修改es最大查询数量

问题:当查询数超过1万时会报错

修改配置文件即可
PUT _settings
{
  "index":{
    "max_result_window":100000
  }
}

一.批量操作

_mget获取多个文档

GET /_mget 
{
    "docs":[
         {
             "_index":"a_fire_ceshi",
             "_type":"ceshi",
             "_id":1
         },
         {
             "_index":"a_fire_ceshi2",
             "_type":"ceshi2",
             "_id":2,
             "_source":"tittle"   //可以获取指定字段信息(多个字段用数组的方式[....])
         },{}....
    ]
}
GET /library/books/_mget
{
    "ids":["1","2","3"]
}

2.批量操作bulk(bulk处理文档大小的最佳值和集群文档硬件等有关,太大影响性能)bulk请求体的大小一次最好控制在15M以内

POST /library/books/_bulk
{"index":{"_id":1}}
{"title":"ceshi","price":5}
{"index":{"_id":2}}
{"title":"ceshi2","price":25}
.......

3.索引映射

属性                    描述                   适用类型
store    yes/no  存储/不存储,默认no            all
index   analyzed/not_annalyze/no 默认analyzed   String
null_value   若为空值可以设置一个默认值         all
boost          设置字段的权值,默认1.0           all
index_analyzer     设置一个索引时用的分析器     all
search_analyzer    设置搜索时用的分析器         all
analyzer        索引,搜索时的分析器,默认standard
                其他whitespace,simple,english     all
include_in_all
index_name
norms      
...............................

<2>动态映射:碰到一个新字段时决定是否自动为该字段添加字段映射
字段下面加 "dynamic":true

4.简单命令

##查看节点:http://slavel:9200/_cat/nodes?v
##查看所有索引:http://slavel:9200/_cat/indices?v
##查看集群健康状态::http://slavel:9200/_cat/health?v

二.基本操作

1、更新文档

 curl -xpost http://localhost:9200/blog/article/1/_update -d '
{
    "script":"ctx._source.content = \"new content\""
}'
2、在文档中插入新字段
curl -xpost http://localhost:9200/blog/article/1/_update -d '
{
    "script":"ctx._source.counter +=1",
    "upsert":0
}'

这样就插入了一个默认值为0的conter字段

##批量更新

//多条数据同时更新 _update_by_query
POST a_fire_zbkb/zbkb/_update_by_query
{
  "script": {
    "source": "ctx._source.FBSJ = '2018-09-25 00:00:00';ctx._source.GXSJ = '2018-09-25 00:00:00'",
        "lang":"painless"
    },
  "query":{
    "bool": {
      "must": [
        {
          "prefix": {
            "SZDXZQH.XZQHBH": {
              "valu
e": "3202"
            }
          }
        }
      ]
    }
  }
}
//重庆批量查询定制修改(原clzt.id=99改为6,value为test)解决bug:车辆信息里的车辆数量不对,99也统计在内导致)
POST a_fire_clxx/clxx/_update_by_query
{
  "script": {
    "source": "ctx._source.CLZT.ID = '6';ctx._source.CLZT.VALUE = 'test'",
        "lang":"painless"
    },
  "query":{
    "bool": {
      "must": [
        {
    "match": {
      "CLZT.ID": "999"
    }
          
        }
      ]
    }
  }
}

2、删除文档

curl -xdelete http://localhost:9200/blog/article/1

3、版本控制
elasticsearch提供乐观锁形式的版本控制，即有多个人操作时可以防止提交冲突问题

curl -xdelete ‘localhost:9200/libary/book/1?version=1’

就是当版本为1时才执行删除操作。

3、字段索引类型

  核心类型：
       string:字符串
       number：数字
       date:日期
       boolean:布尔
       binary：二进制
（1）公共属性
   index：可设置analyzed和no
         analyzed,该字段将会编入索引以供搜索，no则无法搜索。默认analyzed。
       另外，基于string的还可以设置为not_analyzed:不经过分析编入索引。
   store：（yes/no 默认no）指定了该字段的原始值是否写入索引中。
   hoost:(默认1)指定了文档中该字段的重要性，值越大重要性越高。
   null_valye:
   copy_to:
   include_in_all:
(2)字符串
"contents":{"type":"string","store":"no","index":"analyzed"}
除了公共属性还有以下属性：
term_vector:
   .......
(3)数值
   byte  short  integer  long  float  double
(4)布尔
"allowed" : { "type" : "boolean", "store": "yes" }
（5）二进制
"image":{"type":"binary"}
(6)日期
"published" : { "type" : "date", "store" : "yes", "format" :
"YYYY-mm-dd" }
除了公共属性还有
format:指定日期格式
  ......
  
（7）IP地址类型
"address" : { "type" : "ip", "store" : "yes" }

4、多字段
两个字段中有相同的字段值，例如一个用来搜索，一个用来排序。
"name": {
"type": "string",
"fields": {
"facet": { "type" : "string", "index": "not_analyzed" }
}
}


5、分析器：有许多已定好的分析器供使用，也可以自定义分析器、

6、相似度模型

7、批量索引以提高索引速度

8、标识符字段

_type:
_all:
_source
_index:
_size:
_timestamp:

二 : 搜索

1 :match查询
2 :team
 **term和match的区别:match查询时会自动提供合适的分析器,而term不会有分析的过程.(match相当于模糊匹配)
 
3 :prefix
4 :wildcard   通配符查询(?表示任意字符)
5 :fuzzy      模糊查询
6 :range      范围查询
7 :query_string
8 :text
9 :missing    过滤器查询,过滤此值的文档

三 : es变红–分片缺失解决办法


POST  _cluster/reroute
{
  "commands": [
    {
      "allocate": {
        "index": "ezview_ssd_90_2016",
        "shard": 5,
        "node": "45.18.51.18"
        , "allow_primary": true
      }
    }
  ]
}

shard：变灰色的显示多少就改为多少
index：索引为变灰色的所在列的索引

四 : es查询命令

查询全部索引
#  GET _cat/indices

a.前缀查询
------------查询首字母j开头的数据---------------
GET /a_fire_test/test/_search
{
    "query":{
        "prefix":{
            "name":"j",
            "rewrite":"constant_score_boolean"
        }
    }
}
-------延伸:改写查询的几个类型如下
score_boolean   查询将每个查询的词项转化为布尔的一个从句
constant_score_boolean  与上面类似,cpu消耗较少(查询权重相同的常数得分)
constant_score_filter   更快(遍历词项创建私有过滤器)
top_terms_N         (N为要展示靠前的数量)
top_terms_boost_N   最快
以上使用原则:高精度(但往往低性能)布尔查询,低精度(高性能)top_N查询

b.更新
-----------脚本按条件更新-----------------
POST /library/book/1/_update 
{
  "script" : "if(ctx._source.year == start_date) ctx._source.year
   = new_date; else ctx._source.year = alt_date;",
   "params" : {
          "start_date" : 1935,
          "new_date" : 1936,
          "alt_date" : 1934
     }
}


<一>_search查询

1,查看索引下的mapping信息
#  GET /ceshi_index/_mapping

2,index/type下根据某个字段数据内容模糊查询
# GET /ceshi_index/ceshi_type/_search?q=title:ceshiziduan
# 即查询该索引类型下title中含有ceshiziduan信息的所有数据(结果中_score是es对此文件检索的打分,越高越有价值)

3,index下根据某个字段模糊查询
# GET /ceshi_index/_search?q=title:ceshiziduan

4,既没有index也没有type模糊查询
# GET /_search?q=title:ceshiziduan

5.查询简例
GET /index/_search
{
    "query":{
        "match_all":{}
    },
    "size":1                     //不写,默认为10
}

<二>term查询
1,查询某字段里有某个关键词的文档
GET /ceshi_index/ceshi_type/_search
{
    "query":{
        "term":{
            "preview":"关键词"
        }
    }
}
-----即查询文档中preview字段中含有关键词的数据
2,多个关键词查询
GET /ceshi_index/ceshi_type/_search
{
    "query":{
        "term":{
            "preview":["关键词1","关键词2"],
            "minimum_match":2
        }
    }
}
--minimum_match:1则查询的这两个关键词至少有一个,minimum_match:2则两个都得有.

<三> 控制查询返回的数量
GET /ceshi_index/ceshi_type/_search
{
    "from":0,
    "size":3,
    "query":{
        "term":{
            "preview":"关键词"
        }
    }
}
--从查询到的第一个文档开始返回三个结果

<四>返回版本号
GET /ceshi_index/ceshi_type/_search
{
    
    "version":true,
    "query":{
        "term":{
            "preview":"关键词"
        }
    }
}
--查询中添加 "version":true 即可

<五>match查询(可接受文字数字日期等数据类型)
#match与term查询的区别:match会根据你给定的字段提供合适的分析器,term不会.
1,match查询
GET /ceshi_index/ceshi_type/_search
{
    "query":{
        "match":{
            "preview":"关键词",
            "price":15
        }
    }
}

2,match_all全部查询
GET /ceshi_index/ceshi_type/_search
{
    "query":{
        "match_all":{}
    }
}

3,match_phrase短语查询,slop定义了两个关键词之间隔多少未知单词
GET /ceshi_index/ceshi_type/_search
{
    "query":{
        "match_phrase":{
            "preview":["关键词1","关键词2"],
            "slop":2
        }
    }
}
4,multi_match指定多个字段查询
GET /ceshi_index/ceshi_type/_search
{
    "query":{
        "multi_match":{
            "query":"关键词",
            "fields":["preview","price"]
        }
    }
}
--即查询"preview","price"字段中都有关键词的数据

<六>指定返回字段的查询
1,返回指定字段
GET /ceshi_index/ceshi_type/_search
{
    "fields":["preview","price"]
    "query":{
        "match":{
            "preview":"关键词"
        }
    }
}
--即只返回查询结果中的preview和price字段

2,通过partial_fields控制加载的字段
GET /ceshi_index/ceshi_type/_search
{
    "partial_fields":{
        "partial":{
            "include":["preview"],
            "exclude":["title,price"]
        }
    },
    "query":{
        "match_all":{}
    }
}
--即查询全部并只加载含preview字段排除title,price字段展示结果.
## 也可以加通配符* 
            "include":["pre*"],
            "exclude":["tit*"]
3.bool查询          
GET /a_fire_shxx/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "DWLB.ID": {
              "value": "百万吨以上"
            }
          }
        }
      ]
    }
    }
  }
}

<七>模糊查询
# fuzzy 模糊查询
# value 查询的关键字
#boost  设置查询的权值
$ min_similarity  设置匹配的最小相似度,字符串:0-1,数值:可能大于1,日期:1d  2d...表示一天,两天

Get /library/book/_search
{
    "query":{
        "fuzzy":{
            "preview":"测试关键字",
            "min_similarity":0.5
        }
    }
}

--多字段查询
GET a_fire_zbdt/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "ZBRQ": {
              "value": "2018-07-11"
            }
          }
        }
      ],
      "must": [
        {
          "term": {
            "SZDXFJG.XFJGBH": {
              "value": "55c99876ec0f425aac6925a92cceeb17"
            }
          }
        }
      ]
    }
  }
}
--模糊查询
GET a_fire_xfdw/xfdw/_search
{
  
  "query": {
   "bool": {
     "must": [
       {
         "term": {
           "DWJB": {
             "value": "3"
           }
         }
       }
     ],
     "must_not": [
       {
         "wildcard": {
           "DWMC": {
             "value": "*专职*队"
           }
         }
       },
       {
         "term": {
           "JLZT": {
             "value": "0"
           }
         }
       }
     ]
   }
  }
  , "size": 500
  , "sort": [
    {
      "DWBH": {
        "order": "desc"
      }
    }
  ]
}
--不等于...多条件过滤查询
GET a_fire_zqxx/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "LASJ": {
              "gte": "2018-01-01 00:00:00",
              "lte": "2018-07-10 21:08:00"
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "ZQBS": {
              "value": "0"
            }
          }
        }
      ]
    }
  }
}

--should 查询(should 表达or关系)
{
    "bool":{
        "should":[
        {
            "term":{
             "test.name":"China"
        }
        },
        {
           "term":{
             "test.name":"Jpanne"
        }
        }
        ]
    }
}

            
<七>排序
--sort  (asc,desc)

"sort" : [
{ "title" : "asc" }
]

推荐使用
{
    "query":{
        "match_all":{ }
    },
    "sort":[
    {
        "title.age":"asc"
    }
    ]
}

--缺少指定字段 的排序
{
    "query":{
        "match_all":{ }
    },
    "sort":[
      {
          "section":{
              "tittle.age":"asc",
              "missing":"_last"   //或者_first
          }
      }
    ]
}
这样缺失字段的数据就会排在后面或顶部

二 . 删除索引数据

POST a_fire_jzxx/jzxx/_delete_by_query
{
  "query":{
    "match_all":{}
  }
}

三 .聚合

1.度量聚合
--查询最小年
{
    "aggs":{
        "min_year":{
            "min":{
                "field":"year"
            }
        }
    }
}
--查询最小年并减去1000
{
    "aggs":{
        "min_year":{
            "min":{
                "script":"doc['year'].value - 1000"
            }
        }
    }
}


2.桶聚合

{
  "aggs": {
    "availability": {
       "terms": {
       "field": "copies",
       "size": 40,
       "order": { 
       "_term": "asc" 
       }
      }
    }
  }
}

luoxingyu500

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
elasticsearch权威指南笔记整理

Lucene1.为何搜索如此之快?倒排索引:Apache Lucene 将写入索引的所有信息组织成一种名为倒排索引（inverted index）的结构。该结构是一种将词项映射到文档的数据结构，其工作方式与传统的关系数据库不同，你大可以认为倒排索引是面向词项而不是面向文档的(文档中的词项会被记录成索引并标明出处,引用频率等,从而更快的被检索到)2.分析器文档是如何转化成倒排索...
复制链接

扫一扫