Elasticsearch简单学习4:入门学习-1-CSDN博客

一、基本概念

1.文档

《1.》Elasticsearch是面向文档的，文档是所有可搜索数据的最小单位。

eg: ① 日志文件中的日志项

② 一部电影的具体信息 | 一张唱片的详细信息

③ MP3播放器里的一首歌 | 一篇PDF文档中的具体内容

《2.》文档会被序列化成JSON格式，保存在Elasticsearch中

① JSON对象由字段组成

② 每个字段都有对应的字段类型（字符串 | 数值 | 布尔 | 日期 | 二进制 | 范围类型）

《3.》每个文档都有一个Uinque ID

① 你可以自己指定ID

② 或者通过Elasticsearch自动生成

2.JSON文档

《1.》一篇文档包含了一系列的字段。类似数据库表中一条记录

《2.》JSON文档，格式灵活，不需要预先定义格式

① 字段的类型可以指定或者通过Elasticsearch自动推算

② 支持数组 | 支持嵌套

3.文档的元数据

4.索引

5.索引的不同语意

6.抽象与类比

为什么不再支持单个Index下，多个Types:

https://www.elastic.co/cn/blog/moving-from-types-to-typeless-apis-in-elasticsearch-7-0

7.REST API

很容易被各种语言调用

8.一些基本的API

之前装Kibana时，我们导入的有一些测试数据，电子商务订单、监控航线的示例数据、监控Web日志的示例数据，

以及装入Logstash时，导入的movie的测试数据，我们可以通过kibana的索引管理看到。

//查看“电子商务订单”索引相关信息
GET kibana_sample_data_ecommerce

//查看“电子商务订单”的文档总数
GET kibana_sample_data_ecommerce/_count

//查看“电子商务订单”前10条文档，了解文档格式
POST kibana_sample_data_ecommerce/_search

//查以kibana开头的索引(通配符查询)
GET /_cat/indices/kibana*?v&s=index

//查看状态为绿的索引
GET /_cat/indices?v&health=green

//按照文档的个数排序
GET /_cat/indices?v&s=docs.count:desc

//查看每个索引占用的内存
GET /_cat/indices?v&h=i,tm&s=tm:desc

CAT Indices:

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/cat-indices.html

二、集群|节点|分片|副本

1.分布式系统的可用性与扩展性

《1.》高可用

服务可用性：允许有节点停止服务

数据可用性：部分节点丢失，不会丢失数据

《2.》可扩展性

请求量提升|数据的不断增长（将数据分布到所有节点上）

2.分布式特性

《1.》Elasticsearch分布式架构的好处

存储的水平扩容

提高系统的可用性，部分节点停止服务，整个集群的服务不受影响

《2.》Elasticsearch的分布式架构

不同的集群通过不同的名字来区分，默认名字“elasticsearch”

通过配置文件修改，或者在命令行中 -E cluster.name=beehive进行设定

一个集群可以有一个或者多个节点

3.节点

《1.》节点是一个Elasticsearch的实例

本质上就是一个JAVA进程

一台机器上可以运行多个Elasticsearch进程，但是，生产环境一般建议一台机器上只运行一个Elasticsearch实例

《2.》每一个节点都有名字，通过配置文件配置，或者启动时候 -E node.name=node1指定

《3.》每一个节点在启动之后，会分配一个UID，保存在data目录下

4.Master-eligible nodes 和Master Node

《1.》每个节点启动后，默认就是一个Master eligible节点

可以设置 node.master:false 禁止

《2.》Master-eligible节点可以参加选主流程，成为Master节点

《3.》当第一个节点启动时候，它会将自己选举成Master节点。

《4.》每个节点上都保存了集群的状态，只有Master节点才能修改集群的状态信息

《5.》集群状态（Cluster State） ,维护了一个集群中，必要的信息

① 所有的节点信息

② 所有的索引和其相关的Mapping与Setting信息

③ 分片的路由信息

注意事项：只有Master节点才能修改集群的状态信息，如果任意节点都能修改信息会导致数据的不一致性

5.Data Node & Coordinating Node

《1.》Data Node

可以保存数据的节点，叫做 Data Node 。负责保存分片数据。在数据扩展上起到了至关重要的作用

《2.》Coordinating Node

① 负责接受Client的请求，将请求分发到合适的节点，最终把结果汇集到一起

② 每个节点默认都起到了 Coordinating Node的职责

6.其他的节点类型

《1.》Hot & Warm Node

不同硬件配置的Data Node , 用来实现 Hot & Warm 架构，降低集群部署的成本。

比如：日志处理时，会设置冷热节点，热节点会有更好的磁盘以及吞吐量。

《2.》Machine Learning Node

负责跑机器学习的Job , 用来做异常检测

《3.》Tribe Node （未来可能会被淘汰）

【5.3开始使用Cross Cluster Search 】,Tribe Node 连接到不同的Elasticsearch集群，

并且支持将这些集群当成一个单独的集群处理。

7.配置节点类型

《1.》开发环境中一个节点可以承担多种角色

《2.》生产环境中，可以设置单一的角色的节点（dedicated node）

性能更好，职责明确，不同的节点可以配置不同的机器！

8.分片(Primary Shard & Replica Shard)

《1.》主分片，用以解决数据水平扩展的问题。通过主分片，可以将数据分布到集群内的所有节点之上。

① 一个分片是一个运行的Lucene的实例

② 主分片数量在索引创建时指定，后续不允许修改，除非Reindex

《2.》副本分片，用以解决数据高可用的问题。分片是主分片的拷贝

① 副本分片数目，可以动态的调整

② 增加副本数，还可以在一定程度上提高服务的可用性（读取的吞吐量）

《3.》一个三节点的集群中，blogs索引的分片分布情况

9.分片的设定

对于生产环境中分片的设定，需要提前做好容量规划

① 分片数目设置过小

* 导致后续无法增加节点实现水平扩展

* 单个分片的数据量太大，导致数据重新分配耗时

② 分片数目设置过大

* 7.0开始，默认主分片设置成1 ，解决了over-sharding的问题

* 影响搜索结果的相关性打分，影响统计结果的准确性

* 单个节点上过多的分片，会导致资源浪费，同时也会影响性能

10.查看集群的健康状况

http://localhost:9200/_cluster/health

《1.》Green - 主分片与副本都正常分配

《2.》Yellow - 主分片全部正常分配，有副本分片未能正常分配

《3.》Red - 有主分片未能分配

eg. 当服务器的磁盘容量超过85%时，去创建了一个新的索引。

《4.》

//查看集群的健康状况
GET _cluster/health

//查看node的信息
GET _cat/nodes

//查看shards信息
GET _cat/shards

CAT Nodes API : https://www.elastic.co/guide/en/elasticsearch/reference/7.1/cat-nodes.html

Cluster APIS : https://www.elastic.co/guide/en/elasticsearch/reference/7.1/cluster.html

CAT Shards API : https://www.elastic.co/guide/en/elasticsearch/reference/7.1/cat-shards.html

三、文档的基本操作

1.CRUD

2.Create一个文档

《1.》支持自动生成文档Id和指定文档Id两种方式

《2.》通过调用 “POST users/_doc” ,系统会自动生成document Id

《3.》使用HTTP PUT users/_doc/1 创建时，URI中显示指定_doc,此时如果该Id的文档已经存在，操作失败

《4.》举例：

3.GET一个文档

《1.》找到文档，返回HTTP 200

文档元信息：

① _index/_type/

② 版本信息，同一个Id的文档，即使被删除，Version号也不会不断增加

③ _source 中默认包含了文档的所有原始信息

《2.》找不到文档，返回HTTP 404

4.Index文档

《1.》Index和Create不一样的地方：如果文档不存在，就索引新的文档。

否则现有的文档会被删除，新的文档被索引。版本信息 +1

5.Update文档

《1.》Update方法不会删除原来的文档，而是实现真正的数据更新

《2.》注意使用的是POST方法，要更新的内容是放在doc中。

6.Bulk API

《1.》支持在一次API调用中，对不同的索引进行操作

《2.》支持四种类型的操作

Index Create Update Delete

《3.》操作中单条操作失败，并不会影响其他的操作

《4.》返回结果包含了每一条操作执行的结果

eg.:https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docs-bulk.html

《5.》批量读取-mget

批量读取，可以减少网络连接所产生的开销，提高性能。

《6.》批量查询-msearch

//msearch 操作
POST kibana_sample_data_ecommerce/_msearch
{}
{"query" : {"match_all" : {}},"size" : 1}
{"index" : "kibana_sample_data_flights"}
{"query" : {"match_all" : {}},"size" : 2}

7.常见错误返回

Document API : https://www.elastic.co/guide/en/elasticsearch/reference/7.1/docs.html

四、倒排索引介绍

倒排索引理解：

https://my.oschina.net/hanchao/blog/3053367

https://zh.wikipedia.org/wiki/%E5%80%92%E6%8E%92%E7%B4%A2%E5%BC%95

https://www.elastic.co/guide/cn/elasticsearch/guide/current/inverted-index.html

1.正排索引和倒排索引

2.倒排索引的核心组成

《1.》单词词典(Term Dictionary)

记录所有文档的单词，记录单词到倒排列表的关联关系。

单词词典一般比较大，可以通过B+树或哈希拉链法实现，以满足高性能的插入与查询。

《2.》倒排列表（Posting List）

-记录了单词对应的文档结合，由倒排索引项组成

倒排索引项（Posting）

① 文档Id

② 词频TF - 该单词在文档中出现的次数，用于相关性评分

③ 位置（Position） - 单词在文档中分词的位置。用于语句搜索（phrase query）

④ 偏移（Offset） - 记录单词的开始结束位置，实现高亮显示

《3.》例子 - Elasticsearch

3.倒排索引说明

五、通过Analyzer进行分词

1.Analysis与Analyzer

《1.》Analysis - 文本分析是把全文本转换成一系列单词（term | token）的过程，也叫分词

《2.》Analysis是通过Analyzer来实现的

可使用Elasticsearch内置的分析器 | 或者按需定制化分析器

《3.》除了在数据写入时转换词条，匹配Query语句时候也需要用相同的分析器对查询语句进行分析

eg:

2.Analyzer的组成

3.Elasticsearch的内置分词器

4.使用_analyzer API

《1.》直接指定Analyzer进行测试

//直接使用指定的Analyzer进行测试
GET /_analyze
{
  "analyzer": "standard",
  "text" : "Masterting Elasticsearch, elasticsearch in Action"
}

结果：

{
  "tokens" : [
    {
      "token" : "masterting",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 11,
      "end_offset" : 24,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 26,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "in",
      "start_offset" : 40,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "action",
      "start_offset" : 43,
      "end_offset" : 49,
      "type" : "<ALPHANUM>",
      "position" : 4
    }
  ]
}

《2.》指定索引的字段进行测试

//指定索引的字段进行测试
POST users/_analyze
{
  "field": "message",
  "text" : "Mastering Elasticsearch"
}

结果：

{
  "tokens" : [
    {
      "token" : "mastering",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "elasticsearch",
      "start_offset" : 10,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

《3.》自定义分词器进行测试

//使用自定义分词器进行测试
POST /_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text" : "Mastering Elastcisearch HANCHAO"
}

结果：

{
  "tokens" : [
    {
      "token" : "mastering",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "elastcisearch",
      "start_offset" : 10,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "hanchao",
      "start_offset" : 24,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

5.Standard Analyzer

举例：

#standard
GET _analyze
{
  "analyzer": "standard",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果：

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

6.Simple Analyzer

按照非字母切分（符号被过滤），小写处理。

#simpe analyzer
GET _analyze
{
  "analyzer": "simple",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果：

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

7.Whitespace Analyzer

#whitespace
GET _analyze
{
  "analyzer": "whitespace",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果：

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "Quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown-foxes",
      "start_offset" : 16,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening.",
      "start_offset" : 62,
      "end_offset" : 70,
      "type" : "word",
      "position" : 11
    }
  ]
}

8.Stop Analyzer

小写处理：停用词过滤（the , a , is）

# stop analyzer
GET _analyze
{
  "analyzer": "stop",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果：

{
  "tokens" : [
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 11
    }
  ]
}

9.Keyword Analyzer★

10.Pattern Analyzer

#pattern analyzer
GET _analyze
{
  "analyzer": "pattern",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果：

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "running",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "foxes",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "lazy",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "dogs",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "in",
      "start_offset" : 48,
      "end_offset" : 50,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "the",
      "start_offset" : 51,
      "end_offset" : 54,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "evening",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "word",
      "position" : 12
    }
  ]
}

11.Language Analyzer

选择不同国家的语音分词的结果是不同的！！

举例：

#english
GET _analyze
{
  "analyzer": "english",
  "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

结果：

{
  "tokens" : [
    {
      "token" : "2",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<NUM>",
      "position" : 0
    },
    {
      "token" : "run",
      "start_offset" : 2,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "quick",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "brown",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "fox",
      "start_offset" : 22,
      "end_offset" : 27,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "leap",
      "start_offset" : 28,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 33,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "lazi",
      "start_offset" : 38,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "dog",
      "start_offset" : 43,
      "end_offset" : 47,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "summer",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 11
    },
    {
      "token" : "even",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}

12.中文分词-ICU Analyzer

《1.》中文句子，切分成一个一个词（不是一个个字）

《2.》英文中，单词有自然的空格作为分隔

《3.》一句中文，在不同的上下文，有不同的理解

eg: 这个苹果，不大好吃 =》这个苹果，不大，好吃

具体安装方式，参考：https://my.oschina.net/hanchao/blog/3070695

举例：

POST _analyze
{
  "analyzer": "standard",
  "text": "他说的确实在理”"
}

##################结果 ####################3
{
  "tokens" : [
    {
      "token" : "他",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "说",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "确",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "实",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "在",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "理",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

icu-analyzer的例子：

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "他说的确实在理”"
}

############################### 结果 ###########################
{
  "tokens" : [
    {
      "token" : "他",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "说的",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "确实",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "在",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "理",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "这个苹果不大好吃”"
}

############## 结果 ###################
{
  "tokens" : [
    {
      "token" : "这个",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "苹果",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "不大",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "好吃",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    }
  ]
}

《4.》更多的中文分词器

① IK

支持自定义词库，支持热更新分词词典

https://github.com/medcl/elasticsearch-analysis-ik

② THULAC

THU Lexucal Analyzer for Chinese ，清华大学自然语言处理和社会人文计算实验室的一套中文分词器

https://github.com/microbun/elasticsearch-thulac-plugin

《5.》参考文档

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/indices-analyze.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

转载于:https://my.oschina.net/hanchao/blog/3074002