ElasticSearch学习(一)：Elasticsearch 基本知识

本文详细介绍了Elasticsearch的基础概念，包括文档、索引、节点和集群，深入探讨了倒排索引与分词机制，以及如何自定义分词器和设置Mapping。同时，文章还讲解了RESTful API的使用，以及如何批量处理文档。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. Elasticsearch之入门

1. 术语介绍

文档 Document：用户存储在es中的数据文档
索引 Index：由具有相同字段的文档列表组成
节点 Node：一个Elasticsearch的运行实例，是集群的构成单元
集群 Cluster：由一个或多个节点组成，对外提供服务

2. Document介绍

Json Object，由字段(Field)组成，常见数据类型如下：

字符串：text，keyword
数值型：long，integer，short，byte，double，float，half_float，scaled_float
布尔：boolean
日期：date
二进制：binary
范围类型：integer_range，float_range，long_range，double_range，date_range

每个文档有一个唯一id标志

自行指定
es自动生成

Document MetaData

元数据，用于标注文档的相关信息

_index：文档所在的索引名
_type：文档所在的类型名
_id：文档唯一id
_uid：组合id，由_type和_id组成(6.X _type不再起作用，同_id一样)
_source：文档的原始Json数据，可以从这里获取每个字段的内容
_all：整合所有字段内容到该字段，默认禁用

3. Index介绍

索引中存储具有相同结构的文档(Document)
- 每个索引都有自己的mapping定义，用于定义字段名和类型
一个集群可以有多个索引，比如：
- nginx日志存储的时候可以按照日期每天生成一个索引来存储
  - nginx-log-2017-01-01
  - nginx-log-2017-01-02
  - nginx-log-2017-01-03

4. restapi介绍

Elasticsreach集群对外提供RESTful API
- REST - REpresentational State Transfer
- URI指定资源，如Index，Document等
- Http Method指明资源操作类型，如GET，POST，PUT，DELETE等
常用的两种交互方式
- Curl命令行
- Kibana DevTools

5. index_api

es有专门的Index API，用于创建，更新，删除索引配置等

    Thpffcj:elasticsearch-6.5.4 thpffcj$ bin/elasticsearch

    Thpffcj:kibana-6.5.4-darwin-x86_64 thpffcj$ bin/kibana

访问localhost:5601端口可以访问kibana的图形化界面，我们可以使用Kibana DevTools使用REST API

    PUT /test_index

    #! Deprecation: the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template
    {
      "acknowledged" : true,
      "shards_acknowledged" : true,
      "index" : "test_index"
    }

    GET _cat/indices

    yellow open test_index xbR-XQYDT7C-4GZZ1vRjfA 5 1 0 0 1.1kb 1.1kb

    DELETE /test_index

    {
      "acknowledged" : true
    }

6. document_api

es有专门的Document API

创建文档
查询文档
更新文档
删除文档
创建index为test_index，type为doc，id为1的文档，高版本后没有type的概念

    PUT /test_index/doc/1
    {
      "username":"thpffcj",
      "age":22
    }

    {
      "_index" : "test_index",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 1,
      "result" : "created",
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "_seq_no" : 0,
      "_primary_term" : 1
    }

不指定id创建文档

    POST /test_index/doc
    {
      "username":"tom",
      "age":20
    }

    {
      "_index" : "test_index",
      "_type" : "doc",
      "_id" : "yfg31mwBWHG_wS6wM641",
      "_version" : 1,
      "result" : "created",
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "_seq_no" : 0,
      "_primary_term" : 1
    }

指定要查询的文档id

    GET /test_index/doc/1

    {
      "_index" : "test_index",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 1,
      "found" : true,
      "_source" : {
        "username" : "thpffcj",
        "age" : 22
      }
    }

搜索所有文档，用到_search

    GET /test_index/doc/_search

    {
      "took" : 12,  查询耗时，单位ms
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : 2,  符合条件的总文档数
        "max_score" : 1.0,
        "hits" : [  返回文档详情数据数据，默认前10个文档
          {
            "_index" : "test_index",  索引名
            "_type" : "doc", 
            "_id" : "yfg31mwBWHG_wS6wM641",  文档的id
            "_score" : 1.0,  文档的得分
            "_source" : {  文档详情
              "username" : "tom",
              "age" : 20
            }
          },
          {
            "_index" : "test_index",
            "_type" : "doc",
            "_id" : "1",
            "_score" : 1.0,
            "_source" : {
              "username" : "thpffcj",
              "age" : 22
            }
          }
        ]
      }
    }

指定查询条件

    GET /test_index/doc/_search
    {
      "query": {
        "term": {
          "_id":"1"
        }
      }
    }

es允许一次创建多个文档，从而减少网络传输开销，提升写入速率
- endpoint为_bulk
- endpoint为_mget

    POST _bulk
    {"index":{"_index":"test_index","_type":"doc","_id":"3"}}
    {"username":"lilei","age":10}
    {"delete":{"_index":"test_index","_type":"doc","_id":"1"}}
    {"update":{"_index":"test_index","_type":"doc","_id":"2"}}
    {"doc":{"age":100}}

    GET _mget
    {
      "docs": [
        {
          "_index": "test_index",
          "_type": "doc",
          "_id": "1"
        },
        {
          "_index": "test_index",
          "_type": "doc",
          "_id": "2"
        }
      ]
    }

2. Elasticsearch之倒排索引与分词

1. 书的目录与索引

如何查找”ACID“关键词所在的页面？

书与搜索引擎

目录页对应正排索引
索引页对应倒排索引

2. 正排与倒排索引简介

正排索引
- 文档Id到文档内容，单词的关联关系
倒排索引
- 单词到文档Id的关联关系

倒排索引

查询包含“搜索引擎”的文档
- 通过倒排索引获得“搜索引擎”对应的文档Id有1和3
- 通过正排索引查询1和3的完整内容
- 返回用户最终结果

3. 倒排索引详解

倒排索引是搜索引擎的核心，主要包含两部分：
- 单词词典(Term Dictionary)
  - 是倒排索引的重要组成
  - 记录所有文档的单词，一般都比较大
  - 记录单词到倒排列表的关联信息
  - 单词字典的实现一般是用B+ Tree
- 倒排列表(Posting List)
  - 记录了单词对应的文档集合，由倒排索引项(Posing)组成
  - 主要包含了如下信息：
  - 文档Id，用于获取原始信息
  - 单词频率(TF，Term Frequency)，记录该单词在该文档中的出现次数，用于后续相关性算分
  - 位置(Position)，记录单词在文档中的分词位置(多个)，用于做词语搜索(Phrase Query)
  - 偏移(Offset)，记录单词在文档的开始和结束位置，用于做高亮显示

倒排索引详解

es存储的是一个json格式的文档，其中包含多个字段，每个字段会有自己的倒排索引

4. 分词介绍

分词是指将文本转换成一系列单词(term or token)的过程，也可以叫做文本分析，在es里面称为Analysis
分词器是es中专门处理分词的组件，它的组成如下：
- Character Filters：针对原始文本进行处理，比如去除html特殊标记符
- Tokenizer：将原始文本按照一定规则切分为单词
- Token Filters：针对tokenizer处理的单词进行再加工，比如转小写，删除或新增等处理

5. analyze_api

es提供了一个测试分词的api接口，方便验证分词效果，endpoint是_analyze

可以直接指定analyzer进行测试

    POST _analyze
    {
      "analyzer": "standard",  分词器
      "text": "hello world!"  测试文本
    }

    {
      "tokens" : [
        {
          "token" : "hello",  粉刺结果
          "start_offset" : 0,  起始偏移
          "end_offset" : 5,  结束偏移
          "type" : "<ALPHANUM>",
          "position" : 0  分词位置
        },
        {
          "token" : "world",
          "start_offset" : 6,
          "end_offset" : 11,
          "type" : "<ALPHANUM>",
          "position" : 1
        }
      ]
    }

可以直接指定索引中的字段进行测试

    POST test_index/_analyze
    {
      "field": "username", 
      "text": "hello world!"
    }

可以自定义分词器进行测试

    POST _analyze
    {
      "tokenizer": "standard", 
      "filter": ["lowercase"], 
      "text": "hello world!"
    }

6. 自带分词器

es自带如下的分词器

Standard
- 默认分词器
- 按词切分，支持多语言
Simple
- 按照非字母切分
Whitespace
- 按照空格切分
Stop
- 相比Simple Analyzer多了Stop World处理
Keyword
- 不分词，直接将输入作为一个单词输出
Pattern
- 通过正则表达式自定义分隔符
- 默认是\W+，即非字词的符号作为分隔符
Language
- 提供了30+常见语言的分词器

7. 中文分词

难点
- 中文分词指的是将一个汉字序列切分成一个一个单独的词，在英文中，单词之间是以空格作为自然分界符，汉语中词没有一个形式上的分界符
- 上下文不同，分词结果迥异
常用分词系统
- IK
  - 实现中英文单词的切分
  - 可自定义词库，支持热更新分词词典
- jieba
  - Python中最流行的分词系统，支持分词和词性标注
  - 支持繁体分词，自定义词典，并行分词等
- Hanlp
  - 有一系列模型与算法组成的Java工具包，目标是普及自然语言处理在生产环境中的应用
- THULAC
  - THU Lexical Analyzer for Chinese，有清华大学自然语言处理与社会人文计算实验室研制推出的一套中文词法分析工具包，具有中文分词和词性标注功能

8. 自定义分词之Character Filter

当自带的分词无法满足需求时，可以自定义分词
- 通过自定义Character Filter，Tokenizer和Token Filter实现

Character Filter

在Tokenizer之前对原始文本进行处理，比如增加，删除或替换字符等
自带的如下：
- HTML Strip去除html标签和转换html实体
- Mapping进行字符替换操作
- Pattern Replace进行正则匹配替换
会影响后续tokenizer解析的position和offset信息

9. 自定义分词之Tokenizer

将原始文本按照一定规则切分为单词(term or token)
自带的如下：
- standard：按照单词进行分割
- letter：按照非字符类进行分割
- whitespace：按照空格进行分割
- UAX URL Email：按照standard分割，但不会分割邮箱和url
- NGram和Edge NGram：连词分割
- Path Hierarchy：按照文件路径进行切割

10. 自定义分词之Token Filter

对于Tokenizer输出的单词(term)进行增加，删除，修改等操作
自带的如下：
- lowercase：将所有term转换为小写
- stop：删除stop words
- NGram和Edge NGram：连词分割
- Synonym：添加近义词的term

11. 自定义分词

自定义分词需要在索引的配置中设定

    PUT test_index_1
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_custom_analyzer": {
              "type": "custom",
              "tokenizer": "standard",
              
              "char_filter": [
                "html_strip"
              ],
              "filter": [
                "lowercase",
                "asciifolding"
              ]
            }
          }
        }
      }
    }

测试效果

    POST test_index_1/_analyze
    {
      "analyzer": "my_custom_analyzer",
      "text": "Is this <b>a box</b>?"
    }

    {
      "tokens" : [
        {
          "token" : "is",
          "start_offset" : 0,
          "end_offset" : 2,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "this",
          "start_offset" : 3,
          "end_offset" : 7,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "a",
          "start_offset" : 11,
          "end_offset" : 12,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "box",
          "start_offset" : 13,
          "end_offset" : 20,
          "type" : "<ALPHANUM>",
          "position" : 3
        }
      ]
    }

12. 分词使用说明

分词会在如下两个时机使用：
- 创建或更新文档时(Index Time)，会对相应的文档进行分词处理
- 查询时(Search Time)，会对查询语句进行分词
一般不需要特别指定查询时分词器，直接使用索引时分词器即可，否则会出现无法匹配的情况

分词的使用建议

明确字段是否需要分词，不需要分词的字段就将type设置为keyword，可以节省空间和提高写性能
善用_analyze API，查看文档的具体分词结果
动手测试

3. Elasticsearch之Mapping设置

1. mapping简介

类似数据库中的表结构定义，主要作用如下：
- 定义Index下的字段名(Field Name)
- 定义字段的类型，比如数值型，字符串型，布尔型等
- 定义倒排索引相关的配置，比如是否索引，记录position等

    GET /test_index/_mapping

    {
      "test_index" : {
        "mappings" : {
          "doc" : {
            "properties" : {
              "age" : {
                "type" : "long"
              },
              "username" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                  }
                }
              }
            }
          }
        }
      }
    }

2. 自定义 mapping

Mapping中的字段类型一旦设定后，禁止直接修改，原因如下：
- Lucene实现的倒排索引生成后不允许修改
重新建立新的索引，然后做reindex操作
允许新增字段
- 通过dynamic参数来控制自动新增字段
  - true(默认)：允许自动新增字段
  - false：不允许自动新增字段，但是文档可以正常写入，但无法对字段进行查询等操作
  - strict：文档不能写入，报错

3. mapping演示

    PUT my_index
    {
      "mappings": {
        "doc": {
          "dynamic": false,
          "properties": {
            "title": {
              "type": "text"
            },
            "name": {
              "type": "keyword"
            },
            "age": {
              "type": "integer"
            }
          }
        }
      }
    }

写入数据

    PUT my_index/doc/1
    {
      "title": "hello world",
      "desc": "nothing here"
    }

    {
      "_index" : "my_index",
      "_type" : "doc",
      "_id" : "1",
      "_version" : 1,
      "result" : "created",
      "_shards" : {
        "total" : 2,
        "successful" : 1,
        "failed" : 0
      },
      "_seq_no" : 0,
      "_primary_term" : 1
    }

查询数据

    GET my_index/doc/_search
    {
      "query": {
        "match": {
          "title": "hello"
        }
      }
    }

    {
      "took" : 10,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : 1,
        "max_score" : 0.2876821,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "doc",
            "_id" : "1",
            "_score" : 0.2876821,
            "_source" : {
              "title" : "hello world",
              "desc" : "nothing here"
            }
          }
        ]
      }
    }

GET my_index/doc/_search
{
  "query": {
    "match": {
      "title": "here"
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

4. copy_to参数说明

将该字段的值复制到目标字段，实现类似_all的作用
不会出现在_source中，只用来搜索

5. index参数说明

控制当前字段是否索引，默认为true，即记录索引，false不记录，即不可搜索

    PUT my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "cookie": {
              "type": "text",
              "index": false
            }
          }
        }
      }
    }

6. index_options参数说明

index_options用于控制倒排索引记录的内容，有如下4种配置
- docs只记录doc id
- freqs记录doc id和term frequencies
- positions记录doc id，term frequencies和term position
- offsets记录doc id，term frequencies，term position和character offsets
text类型默认配置为positions，其他默认为docs
记录内容越多，占用空间越大

    PUT my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "cookie": {
              "type": "text",
              "index_options": "offsets"
            }
          }
        }
      }
    }

null_value
当字段遇到null值时的处理策略，默认为null，即空值，此时es会忽略该值，可以通过设定该值设定字段的默认值

7. 数据类型

核心数据类型
- 字符串型text，keyword
- 数值型long，integer，short，byte，double，float，half_float，scaled_float
- 日期类型date
- 布尔类型boolean
- 二进制类型binary
- 范围类型integer_range，float_range，long_range，double_range，date_range
复杂数据类型
- 数组类型array
- 对象类型object
- 嵌套类型nested object
地理位置数据类型
- geo_point
- geo_shape
专用类型
- 记录ip地址ip
- 实现自动补全completion
- 记录分词数token_count
- 记录字符串hash值murmur3
- percolator
- join
多字段特性muti-fields
- 允许对同一个字段采用不同的配置，比如分词，常见例子如对人名实现拼音搜索，只需要在人名中新增一个子字段pinyin即可

8. dynamic-mapping简介

es可以自动识别文档字段类型，从而降低用户使用成本
- es是依靠JSON文档的字段类型来实现自动识别字段类型

9. dynamic日期与数字识别

日期的自动识别可以自行配置日期格式，以满足各种需求
字符串是数字时，默认不会自动识别为整型，因为字符串中出现数字是完全合理的

10. dynamic-template简介

允许根据es自动识别的数据类型，字段名等来动态设定字段类型，可以实现如下效果：
- 所有字符串类型都设定为keyword类型，即默认不分词
- 所有以message开头的字段都设定为text类型，即分词
- 所有以long_开头的字段都设定为long类型
- 所有自动匹配为double类型的都设定为float类型，以节省空间
匹配规则一般有有如下几个参数：
- match_mapping_type匹配es自动识别的字段类型，如boolean，long，string等
- match，unmatch匹配字段名
- path_match，path_unmatch匹配路径

    PUT test_index
    {
      "mappings": {
        "doc": {
          "dynamic_templates": [
            {
              "message_as_text": {
                "match_mapping_type": "string",
                "match": "message",
                "mapping": {
                  "type": "text"
                }
              } 
            },
            {
              "string_as_keywords": {
                "match_mapping_type": "string",
                "mapping": {
                  "type": "keyword"
                }
              } 
            }
            ]
        }
      }
    }

    PUT test_index/doc/1
    {
      "name": "Thpffcj",
      "message": "hello world"
    }

查看索引类型

GET test_index/_mapping

{
  "test_index" : {
    "mappings" : {
      "doc" : {
        "dynamic_templates" : [
          {
            "message_as_text" : {
              "match" : "message",
              "match_mapping_type" : "string",
              "mapping" : {
                "type" : "text"
              }
            }
          },
          {
            "string_as_keywords" : {
              "match_mapping_type" : "string",
              "mapping" : {
                "type" : "keyword"
              }
            }
          }
        ],
        "properties" : {
          "message" : {
            "type" : "text"
          },
          "name" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}