ElasticSearch ——（一）介绍与基本概念

最新推荐文章于 2021-09-04 21:05:34 发布

coffejoy

最新推荐文章于 2021-09-04 21:05:34 发布

阅读量215

点赞数

分类专栏： ElasticSearch ElasticSearch系统性学习

本文链接：https://blog.csdn.net/weixin_42142408/article/details/88987675

版权

ElasticSearch 同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

ElasticSearch系统性学习

11 篇文章 1 订阅

订阅专栏

本系列文章主要参考ES官方文档，一方面，会将英文文档翻译成中文；另一方面，在此基础上，再加上一些自己的实践和理解。

反思自己这两年的技术学习，都是零散片面的，缺少系统性，所以想系统地学习下ES，花一些时间将ES官网文档多看几遍，再去深入地研究内部原理，如果成功了，也就有底气说自己会ES，熟悉ES。

介绍

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

ES是一个高可用的开源全文搜索分析引擎。使用它可以做到存储、查询、实时分析大数据。它通常被用于做基础引擎技术支撑，提供给应用实现复杂的查询需求。

特点

速度快
- 通过有限状态转换器实现了用于全文检索的倒排索引
可扩展性强
- 水平扩展
弹性强
- 硬件故障、网络分割、确保集群的安全性可用性
- 跨集群复制
灵活性大
- 数字、文本、地理位置、结构化数据、非结构化数据。
- 可以解决应用搜索、日志分析、指标等问题

用途

Here are a few sample use-cases that Elasticsearch could be used for:

You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.

电商的商品搜索。可以提供查询词提示，收集数据，进行分析、数据挖掘。

You run a price alerting platform which allows price-savvy customers to specify a rule like “I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month”. In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.

价格监控平台。

You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.

商业智能分析。

基本概念

近实时

Near Realtime (NRT)
Elasticsearch is a near-realtime search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

ES是一个近实时的搜索平台。从索引文档到可以查询，只有非常低的延时（通常1秒）

集群

Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is “elasticsearch”. This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

集群是有一个或多个ES节点。一个集群是用唯一的name定义的。name是比较重要的，节点会根据集群的名字而注册成为其中的一部分。

节点

Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup.

节点通过name进行区分，默认的是UUID。

A node can be configured to join a specific cluster by the cluster name.

节点通过配置集群名称加入集群

索引index

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.

一个索引就是相同特性文档的集合。索引通过name（必须是小写）进行区分。name被用来进行索引、查询、更新、删除操作。

In a single cluster, you can define as many indexes as you want.

在一个单节点集群中，你可以定义很多索引。

类型Type

A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index, e.g. one type for users, another type for blog posts. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version. See Removal of mapping types for more.

类型是索引的一种逻辑分类。在一个索引下，可以创建多个不同类型的文档集合。例如，索引是文章，在这个索引下，可以有多Type，一种Type是技术文章，一种type是管理文章。

整个type的概念，将会在后面的版本中去除。（至少是还没有去除吧）

文档Document

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is a ubiquitous internet data interchange format

文档是可以被索引的基本单元。

Within an index/type, you can store as many documents as you want.

一个index/type下，可以存储多个文档

分片Shards & 副本Replicas

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

一个索引可能存储超过单个节点硬件存储大小的数据。

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent “index” that can be hosted on any node in the cluster.

ES解决这个问题，是通过将索引切分成多个分片（shards），当创建索引时，你可以定义分片number，每个shard都有index的完整功能，shard可能被寄宿在集群上的任意节点。

Sharding is important for two primary reasons:
It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple >nodes) thus increasing performance/throughput

Shard非常重要，有两个主要原因：

允许将数据容量水平扩展
允许分布式、并行操作，以提升吞吐量、性能

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

在分布式网络中，有可能会随时发生失败、异常。一个节点或分片，可能因为某种原因突然掉线或消失，因此，一个故障转移机制就显得非常重要。

ES允许创建一个或多个对shard的复制，复制后的结果，也叫做副本分片（副本shard）。

副本很重要，也有两个主要原因。