什么是ElasticSearch?
*独立的网络上的一个或一组进程节点
*对外提供搜索服务(http或transport协议)
*对内就是一个搜索数据库
- 基于Apache Lucene构建的开源搜素引擎;
- 采用java编写,提供简单易用的RESTFul API;
- 轻松的横向扩展,可支持PB级的结构化或非结构化数据处理;
应用场景?
-
海量数据分析引擎;
-
站内搜索引擎;
-
数据仓库;
-
英国卫报-实时分析公众对文章的回应
-
维基百科、GitHub-站内实时搜索
-
百度-实时日志监控平台
安装
Windows下参考https://blog.csdn.net/yx1214442120/article/details/55102298
基础概念
- 索引:含有相同属性的文档合集
- 类型:索引可以定义一个或多个类型,文档必须属于一个类型
- 文档:文档是可以被索引的基本数据单位
- 分片:每个索引都有多个分片,每个分片是一个Lucene索引
- 备份:拷贝一份分片就完成了分片的备份
索引:1、搜索中数据库或表定义
2、构建文档时候的索引创建
分词:1、搜索是以词为单位做最基本的搜索单元
2、依靠分词器构建分词
3、用分词构建倒排索引
搜索过程
搜索的本质和原理
比如我们一个Chinese,同时命中两个word,那么怎么判断哪一个优先级更高
TF-IDF打分
- TF:词频 这个document文档包含了多少个这个词,包含越多表明越相关
- DF:文档频率 包含该词的文档总数目
- IDF:DF取反
意味着分出十个词,,谷歌在5个文档中出现了,倒排列表表示在文档ID1出现了一次,出现的位置是1
所以它的打分是TF/IDF第一个1/5,第二个1/5 ,第三个2/5
分布式索引原理
- number_of_shards:定义索引主分片数量,用于响应写操作(也可响应读)
- number_of_replicas:定义索引备份分片数量,用于响应读操作
- 分布式索引依据分片配置均匀的响应用户请求
- 通过paxos方式从具备竞争主节点能力的机器中竞选主节点后,所有写请求都要通过主节点
- 读请求可不经过主节点,直接发生在从节点上,若对应节点无分片则路由到对应有分片的节点中。
ES的安装
www.elastic.co
进入官网
点击这里下载两个
点击es中bin文件的.bat,直接启动
同理点击kibana的.bat文件启动
使用
两个应用启动之后,登陆localhost:5601
如下图所示,可以创建索引
但我们发现节点是yellow
因为我们简单创建时,默认指定主分片,从分片;主从分片都在一个节点上面
假如我们改成
PUT /test
{
“settings”: {
“number_of_shards”: 1,
“number_of_replicas”: 0
}
}
则变成green,但扩展性会变差
分布式原理
1、分片
(默认1)
2、主从
3、路由
需要知道主分片,从分片分别在哪里
假如添加节点进集群
则会使用负载均衡
重新竞选master
集群搭建
复制两个
然后注意因为刚才练习的时候已经建立过索引,所以把data文件夹下的node删除;
然后改.yml配置
node-1
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: dianping-app
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 127.0.0.1
#
# Set a custom port for HTTP:
#
http.port: 9200
transport.tcp.port: 9300
http.cors.enabled: true
http.cors.allow-origin: "*"
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["127.0.0.1:9300", "127.0.0.1:9301","127.0.0.1:9302"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["127.0.0.1:9300", "127.0.0.1:9301","127.0.0.1:9302"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
node-2
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: dianping-app
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: node-3
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 127.0.0.1
#
# Set a custom port for HTTP:
#
http.port: 9202
transport.tcp.port: 9302
http.cors.enabled: true
http.cors.allow-origin: "*"
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["127.0.0.1:9300", "127.0.0.1:9301","127.0.0.1:9302"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["127.0.0.1:9300", "127.0.0.1:9301","127.0.0.1:9302"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
node-3同理
建立完集群后,此时再到kibana创建,此时显示green
基础语法
新建索引
1、结构化索引
2、非结构化索引
DELETE employee
PUT /employee
{
“settings”: {
“number_of_shards”: 1,
“number_of_replicas”: 1
}
}
PUT /employee/_doc/1
{
“name”:“xintu”,
“age”:30
}
更新的时候直接在字段上更改,然后再运行
3、强制指定创建
强制指定创建,若已存在,则失败
获取索引
1、查询全部文档
GET /employee/_search
2、不带条件查询所有记录
GET /employee/_search
{
“query”:{
“match_all”:{}
}
}
3、分页查询
GET /employee/_search
{
“query”:{
“match_all”:{}
},
“from”:0,
“size”:1
}
4、带关键字条件的查询
//带关键字条件的查询
GET /employee/_search
{
“query”: {
“match”: {“name”:“兄弟”}
}
}
此时可以搜索出来
如果把兄弟改成兄/兄长等,均可以出来
5、带排序查询
//带排序
GET /employee/_search
{
“query”: {
“match”: {“name”:“兄”}
},
“sort”: [
{
“age”: {
“order”: “desc”
}
}
]
}//年龄倒序
6、带filter的查询
GET /employee/_search
{
“query”: {
“bool”: {
“filter”: [
{“term”: {
“age”: “30”
}}
]
}
}
}//查询年龄为30的
filter不会打分
7、带聚合查询
//带聚合
GET /employee/_search
{
“query”: {
“match”: {“name”:“兄”}
},
“sort”: [
{
“age”: {
“order”: “desc”
}
}
],
“aggs”: {
“group_by_age”: {
“terms”: {
“field”: “age”
}
}
}
}
像这个就是查每个年龄段符合的结果有几个
“aggregations” : {
“group_by_age” : {
“doc_count_error_upper_bound” : 0,
“sum_other_doc_count” : 0,
“buckets” : [
{
“key” : 30,
“doc_count” : 2
}
]
}
}
修改索引
1/指定字段修改
//指定字段修改
POST /employee/_update/1
{
“doc”:{
“name”:“凯杰4”
}
}