【ElasticSearch】ElasticSearch实战

最新推荐文章于 2024-07-12 15:43:18 发布

chen.yukang

最新推荐文章于 2024-07-12 15:43:18 发布

阅读量983

点赞数 27

文章标签： elasticsearch 大数据搜索引擎

本文链接：https://blog.csdn.net/m0_64944491/article/details/139834535

版权

初步检索

检索 ES 信息

1）、GET /_cat/nodes：查看所有节点

127.0.0.1 44 83 1 0.01 0.01 0.00 dilm * 1b06a843b8e3

*代表主节点

2）、GET /_cat/health：查看健康状况

1718265331 07:55:31 elasticsearch yellow 1 1 4 4 0 0 1 0 - 80.0%

green表示健康值正常

3）、GET /_cat/master：查看主节点

7NZD92ZKTTGcvCiRiYgipw 127.0.0.1 127.0.0.1 1b06a843b8e3

4）、GET /_cat/indices：查看所有索引，等价于mysql数据库的show databases;

green  open .kibana_task_manager_1   sDt5UmEmSHqFXBxT7O80KQ 1 0 2 0 21.7kb 21.7kb
green  open .apm-agent-configuration iQ8r6SPhRkm2Cq86D2koWg 1 0 0 0   283b   283b
yellow open index                    ilDKtPGtQDOSagS6tk9QPw 1 1 1 0  3.4kb  3.4kb
green  open .kibana_1                9vfcQSsNSWGunawX2uhkqQ 1 0 8 0 32.7kb 32.7kb

新增文档

保存一个数据，保存在哪个索引的哪个类型下（哪张数据库哪张表下），保存时用唯一标识指定

# 在customer索引下的external类型下保存1号数据
PUT customer/external/1
{
 "name":"John Doe"
}

#! Deprecation: [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
{
  "_index" : "customer", // 表明该数据在哪个数据库下
  "_type" : "external", // 表明该数据在哪个类型下
  "_id" : "1", // 表明被保存数据的id
  "_version" : 1, // 被保存数据的版本
  "result" : "created", // 表示创建了一条数据
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

PUT 和 POST 的区别

POST如果不指定id，会自动生成id；指定id就会修改这个数据，并新增版本号
PUT必须指定id，一般用来做修改操作，不指定id会报错

查看文档

GET /customer/external/1：查看customer索引下的external类型下的文档

{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0, // 并发控制字段，每次更新都会+1，用来做乐观锁
  "_primary_term" : 1, // 同上，主分片重新分配，如重启，就会变化
  "found" : true,
  "_source" : {
    "name" : "John Doe"
  }
}

更新文档

POST customer/externel/1/_update
{
    "doc":{
        "name":"John Smith"
    }
}
或者
POST customer/externel/1
{
    "doc":{
        "name":"John Smith"
    }
}
或者
PUT customer/externel/1
{
    "doc":{
        "name":"John Smith"
    }
}

带有_update的情况下，POST操作会对比原文档数据，如果相同不做操作；PUT操作总会重新保存并增加version版本

删除文档或索引

ES并没有提供删除类型的操作，只提供了删除索引和文档的操作

DELETE customer/external/1
DELETE customer

{
    "_index": "customer",
    "_type": "external",
    "_id": "1",
    "_version": 14,
    "result": "deleted",
    "_shards": {
        "total": 2,
        "successful": 1,
        "failed": 0
    },
    "_seq_no": 22,
    "_primary_term": 6
}

批量操作_bulk

bulk api按顺序执行所有的action（动作）。如果一个单个的动作因任何原因失败，它将 继续处理 它后面剩余的动作。当bulk api返回时，它将提供每个动作的状态（与发送的顺序相同），所以您可以检查是否一个指定的动作是否失败了

# 对整个索引执行批量操作
POST /_bulk
{"delete":{"_index":"website","_type":"blog","_id":"123"}}
{"create":{"_index":"website","_type":"blog","_id":"123"}}
{"title":"my first blog post"}
{"index":{"_index":"website","_type":"blog"}}
{"title":"my second blog post"}
{"update":{"_index":"website","_type":"blog","_id":"123"}}
{"doc":{"title":"my updated blog post"}}

{
  "took" : 227,
  "errors" : false,
  "items" : [
    {
      "delete" : {
        "_index" : "website",
        "_type" : "blog",
        "_id" : "123",
        "_version" : 1,
        "result" : "not_found", // 1、没有该记录
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 0,
        "_primary_term" : 1,
        "status" : 404
      }
    },
    {
      "create" : {
        "_index" : "website",
        "_type" : "blog",
        "_id" : "123",
        "_version" : 2,
        "result" : "created", // 2、创建成功
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 1,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "index" : {
        "_index" : "website",
        "_type" : "blog",
        "_id" : "rDm8EJABRl6keg4IGZWd",
        "_version" : 1,
        "result" : "created", // 3、保存成功
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 2,
        "_primary_term" : 1,
        "status" : 201
      }
    },
    {
      "update" : {
        "_index" : "website",
        "_type" : "blog",
        "_id" : "123",
        "_version" : 3,
        "result" : "updated", // 4、更新成功
        "_shards" : {
          "total" : 2,
          "successful" : 1,
          "failed" : 0
        },
        "_seq_no" : 3,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

进阶检索

_search检索

_search检索支持两种方式：参数拼uri和参数放在请求体

# 请求参数方式检索
GET bank/_search?q=*&sort=account_number:asc
说明：
q=* # 查询所有
sort # 排序字段
asc # 升序

# 请求参数放在请求体
GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" },
    { "balance":"desc"}
  ]
}

返回内容：

took – 花费多少ms搜索
timed_out – 是否超时
_shards – 多少分片被搜索了，以及多少成功/失败的搜索分片
max_score –文档相关性最高得分
hits.total.value - 多少匹配文档被找到
hits.sort - 结果的排序key（列），没有的话按照score排序
hits._score - 相关得分 (not applicable when using match_all)

DSL

ES提供了一个可以执行查询的Json风格的DSL(domain-specific language，领域特定语言)，被称为Query DSL，该查询语言非常全面

如果针对于某个字段，那么它的结构如下：
{
  QUERY_NAME: {   # 使用的功能
     FIELD_NAME: {  #  功能参数
       ARGUMENT: VALUE,
       ARGUMENT: VALUE,...
      }   
   }
}

示例，使用时不要加#注释内容

GET bank/_search
{
  "query": {  #  查询的字段
    "match_all": {}
  },
  "from": 0,  # 从第几条文档开始查
  "size": 5,
  "_source":["balance"],
  "sort": [
    {
      "account_number": {  # 返回结果按哪个列排序
        "order": "desc"  # 降序
      }
    }
  ]
}
_source为要返回的字段

query定义如何查询；

match_all查询类型【代表查询所有的索引】，es中可以在query中组合非常多的查询类型完成复杂查询；
from+size限定，完成分页功能；
sort排序，多字段排序，会在前序字段相等时后续字段内部排序，否则以前序为准；

Mapping映射

Mapping 是用来定义一个文档（document），以及它所包含的属性（field）是如何存储和索引的。使用maping来定义：

哪些字符串属性应该被看做全文本属性（full text fields）；
哪些属性包含数字，日期或地理位置；
文档中的所有属性是否都嫩被索引（all 配置）；
日期的格式；
自定义映射规则来执行动态添加属性；
查看mapping信息：GET bank/_mapping

创建索引并指定映射

第一次存储数据的时候es就猜出了映射第一次存储数据前可以指定映射

PUT /my_index
{
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "email": {
        "type": "keyword" # 指定为keyword
      },
      "name": {
        "type": "text" # 全文检索。保存时候分词，检索时候进行分词匹配
      }
    }
  }
}

查看映射

GET /my_index

{
  "my_index" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "integer"
        },
        "email" : {
          "type" : "keyword"
        },
        "name" : {
          "type" : "text"
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1718270515822",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "OfDvFnDvQiq8Jq-3oDHDog",
        "version" : {
          "created" : "7040299"
        },
        "provided_name" : "my_index"
      }
    }
  }
}

添加字段映射

PUT /my_index/_mapping
{
  "properties": {
    "employee-id": {
      "type": "keyword",
      "index": false # 字段不能被检索，只是一个冗余字段
    }
  }
}

不能更新映射

对于已经存在的字段映射，我们不能更新。更新必须创建新的索引，进行数据迁移

数据迁移

先创建new_twitter的正确映射，然后使用如下方式进行数据迁移

6.0以后写法
POST reindex
{
  "source":{
      "index":"twitter"
   },
  "dest":{
      "index":"new_twitters"
   }
}

老版本写法
POST reindex
{
  "source":{
      "index":"twitter",
      "twitter":"twitter"
   },
  "dest":{
      "index":"new_twitters"
   }
}

分词

内置的分词只支持对英文的分词，安装 ik 分词器

SpringBoot 整合 ES

客户端选型

通过 9300 端口（ES 集群间通信也是 9300 端口），维护一个 TCP 长连接

spring 提供了 spring-data-elasticsearch:transport-api.jar
- springboot版本不同，transport-api.jar 不同，不能适配 ES 版本
- 7.x 已经不建议使用，8 以后就要废弃
通过 9200 端口，给 ES 发送 HTTP 请求

JestClient：非官方，更新慢
RestTemplate：ES 很多操作需要自己封装，麻烦
HttpClient：同上
Elasticsearch-Rest-Client：官方 RestClient，封装了 ES 操作，API 层次分明，上手简单

最终选择 Elasticsearch-Rest-Client（elasticsearch-rest-high-level-client），参考文档：Java High Level REST Client | Java REST Client [7.17] | Elastic

ES 做商品检索

商品上架需求

上架的商品才可以在网站展示
上架的商品可以被检索

商品如何检索

ES 比 MySQL 更适合做全文检索，它的数据存在内存中，对于电商中海量商品的搜索场景，可以通过 ES 数据分片的集群部署方式，提供全文检索和复杂查询支持

对于搜索场景，我们要支持品牌、类型、属性规格的搜索

SKU 在 ES 中怎么存

分析：商品上架在 ES 中是存 SKU 还是 SPU？

检索的时候输入名字，是需要按照sku的title进行全文检索的
检素使用商品规格，规格是spu的公共属性，每个spu是一样的
按照分类id进去的都是直接列出spu的，还可以切换
我们如果将sku的全量信息保存到es中（包括spu属性〕就太多字段了

方案1

{
    skuId:1
    spuId:11
    skyTitile:华为xx
    price:999
    saleCount:99
    attr:[
        {尺寸:5},
        {CPU:高通945},
        {分辨率:全高清}
  ]
}

在sku级别冗余存储规格属性
缺点：如果每个sku都存储规格参数(如尺寸)，会有冗余存储，因为每个spu对应的sku的规格参数都一样
假设100万商品，每个spu平均规格属性有2kb数据，等于冗余存储多用了2个G的内存

方案2

sku索引
{
    spuId:1
    skuId:11
}

attr索引
{
    spuId:11
    attr:[
        {尺寸:5},
        {CPU:高通945},
        {分辨率:全高清}
  ]
}

不冗余存储，规格属性只在spu级别保存了一份
缺点：因为展示的规格属性是动态计算出来的，如何计算？在我们搜索商品关键字时，ES 会搜索出所有标题里包含这个关键字的商品，聚合起来分析这些商品涉及的所有规格属性和属性值。如果在这种方案下实现动态计算，假设搜索“小米”有10w个商品，对应4000个spu，再根据4000个spu查询对应的规格属性
假设spuId用long类型，占8字节，一个请求占8B*4000=32000B=32KB
假设有1w人并发检索，就传了320MB的数据，占用大量网络带宽，很可能会网络阻塞

最终选择方案1，用空间换时间

建立索引

{ "type": "keyword" }，保持数据精度问题，可以检索，但不分词
"analyzer": "ik_smart"，中文分词器
"index": false，不可被检索，不生成index
"doc_values": false ，默认为true，不可被聚合，es就不会维护一些聚合的信息

PUT product
{
    "mappings":{
        "properties": {
            "skuId":{ "type": "long" },
            "spuId":{ "type": "keyword" },  # 不可分词
            "skuTitle": {
                "type": "text",
                "analyzer": "ik_smart"  # 中文分词器
            },
            "skuPrice": { "type": "keyword" },  # 保证精度问题
            "skuImg"  : { "type": "keyword" },  # 视频中有false
            "saleCount":{ "type":"long" },
            "hasStock": { "type": "boolean" }, # 只存是否有库存，不存库存量
            "hotScore": { "type": "long"  }, # 热度评分
            "brandId":  { "type": "long" },
            "catalogId": { "type": "long"  },
            "brandName": {"type": "keyword"}, # 视频中有false
            "brandImg":{
                "type": "keyword",
                "index": false,  # 不可被检索，不生成index，只用做页面使用
                "doc_values": false # 不可被聚合，默认为true
            },
            "catalogName": {"type": "keyword" }, # 视频里有false
            "attrs": {
                "type": "nested", # 嵌入式对象，避免被扁平化处理
                "properties": {
                    "attrId": {"type": "long"  },
                    "attrName": {
                        "type": "keyword",
                        "index": false, # 不可被索引，不生成索引
                        "doc_values": false
                    },
                    "attrValue": {"type": "keyword" }
                }
            }
        }
    }
}

nested 嵌入式对象

数组类型的对象会被扁平化处理（对象的每个属性会分别存储到一起）

user.name=["aaa","bbb"]
user.addr=["ccc","ddd"]

这种存储方式，可能会发生如下错误：
错误检索到{aaa,ddd}，这个组合是不存在的

数组的扁平化处理会使检索能检索到本身不存在的，为了解决这个问题，就采用了嵌入式属性，数组里是对象时用嵌入式属性（不是对象无需用嵌入式属性）

nested阅读：ElasticSearch - 嵌套对象 nested_elasticsearch nested java-CSDN博客

使用聚合：Elastic search中使用nested类型的内嵌对象-CSDN博客

chen.yukang

关注

27
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
【ElasticSearch】ElasticSearch实战

"hasStock": { "type": "boolean" }, # 只存是否有库存，不存库存量。"catalogName": {"type": "keyword" }, # 视频里有false。"skuImg" : { "type": "keyword" }, # 视频中有false。"brandName": {"type": "keyword"}, # 视频中有false。"skuPrice": { "type": "keyword" }, # 保证精度问题。
复制链接

扫一扫