SpringBoot整合ElasticSearch学习

最新推荐文章于 2024-08-11 18:56:00 发布

Faith_777

最新推荐文章于 2024-08-11 18:56:00 发布

阅读量1.9k

点赞数 1

文章标签：搜索引擎 java elasticsearch

本文链接：https://blog.csdn.net/Faith_777/article/details/122075201

版权

1. 什么是搜索引擎？

顾名思义就是按关键字搜索到你想要的内容，类似于百度

搜素引擎的应用场景

当数据库数据量很大时，使用select * from 表名 where 字段名 like ‘%关键字%’来查找，效率很低，对数据库服务器压力很大，所以为了减小数据库的压力，加快搜索的效率使用搜索引擎，一般的开源搜索引擎有Lucene、solr、Elasticsearch.

倒排索引

在实际的运用中，我们可以对数据库中原始的数据结构（左图），在业务空闲时事先根据左图内容，创建新的倒排索引结构的数据区域（右图）。
用户有查询需求时，先访问倒排索引数据区域（右图），得出文档id后，通过文档id即可快速，准确的通过左图找到具体的文档内容。
这一过程，可以通过我们自己写程序来实现，也可以借用已经抽象出来的通用开源技术来实现。

什么是索引？

索引是具有类似特性的文档的集合，索引相当于SQL中的一个数据库，或者一个数据存储方案(schema)。而网站的内容也不是一尘不变，当网站内容变化，或者增加了新的网站，我们就需要维护索引。

Elasticsearch和其他搜索引擎的关系

Elasticsearch和solr都是基于Lucene企业级搜索引擎产品。Lucene是底层的api。而Elasticsearch也是近来比较流行的搜索引擎工具

2. Elasticsearch的特点

可分布式能处理PB级数据，也可以单机
开箱即用
全文检索，同义词处理，相关度排名，复杂数据分析，海量数据的近实时处理

ElasticSearch体系结构体系结构

下表是Elasticsearch与MySQL数据库逻辑结构概念的对比

Elasticsearch	关系型数据库Mysql
索引(index)	数据库(databases)
类型(type)	表(table)
文档(document)	行(row)

windows下ElasticSearch环境

下载ElasticSearch 运行bat文件
https://www.elastic.co/cn/downloads/elasticsearch
访问http://localhost:9200/

安装kibana

下载https://www.elastic.co/cn/downloads/kibana
注意版本和elasticsearch一致，依赖elasticsearch，kibana是一个操作elasticsearch的客户端
访问http://localhost:5601

ElasticSearch基本操作

新建索引 PUT http://127.0.0.1:9200/articleindex2/
注意索引不能大写
新建文档 POST http://127.0.0.1:9200/articleindex/article
查询全部文档 GET http://127.0.0.1:9200/articleindex/article/_search
修改文档 POST http://127.0.0.1:9200/articleindex/article/2LitrnABLd3o-sXwAWX2

什么是mapping？

当我们创建一个类型

PUT /myindex/article/1 
{ 
  "post_date": "2018-05-10", 
  "title": "Java", 
  "content": "java is the best language", 
  "author_id": 119
}

默认会给这个type创建一个mapping，查看mapping GET /myindex/article/_mapping

{
  "myindex" : {
    "mappings" : {
      "article" : {
        "properties" : {
          "author_id" : {
            "type" : "long"
          },
          "content" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          },
          "post_date" : {
            "type" : "date"
          },
          "title" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    }
  }
}

mapping定义了type中的每个字段的数据类型以及这些字段如何分词等相关属性

es支持的基本数据类型

字符型: String (String包括text 和 keyword)

text类型被用来索引长文本，在建立索引前会将这些文本进行分词，转化为词的组合，建立索引。允许es来检索这些词语。text类型不能用来排序和聚合。
Keyword类型不需要进行分词，可以被用来检索过滤、排序和聚合。keyword 类型字段只能用本身来进行检索
数字型：long, integer, short, byte, double, float
日期型：date
布尔型：boolean
二进制型：binary

自定义类型

#给索引lib2创建映射类型
PUT /lib2
{
"settings":{

"number_of_shards" : 3,

"number_of_replicas" : 0

},

 "mappings":{
 
  "books":{
  
    "properties":{
    
        "title":{"type":"text"},
        "name":{"type":"text","index":false},
        "publish_date":{"type":"date","index":false},
        
        "price":{"type":"double"},
        
        "number":{"type":"integer"}
    }
  }
 }
}

配置分词器

下载 https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.8.3
放入plugin下ik文件夹

当我们通过postman请求
下载分词器 https://github.com/medcl/elasticsearch-analysis-ik/releases?after=v6.8.3
下载后放入elasticsearch 的plugins中的ik文件夹重启elasticsearch
再从发起postman请求，这次加上"analyzer":"ik_max_word"

k_max_word：会将文本做最细粒度的拆分，例如「中华人民共和国国歌」会被拆分为「中华人民共和国、中华人民、中华、华人、人民共和国、人民、人、民、共和国、共和、和、国国、国歌」，会穷尽各种可能的组合
ik_smart：会将文本做最粗粒度的拆分，例如「中华人民共和国国歌」会被拆分为「中华人民共和国、国歌」

一些基本操作

term查询查询某个字段里含有某个关键词的文档

GET /lib3/user/_search/
{
  "query": {
      "term": {"interests": "changge"}
  }
}

GET /lib3/user/_search
{
    "query":{
        "terms":{
            "interests": ["hejiu","changge"]
        }
    }
}

控制查询返回的数量

from：从哪一个文档开始
size：需要的个数

GET /lib3/user/_search
{
    "from":0,
    "size":2,
    "query":{
        "terms":{
            "interests": ["hejiu","changge"]
        }
    }
}

返回版本号

GET /lib3/user/_search
{
    "version":true,
    "query":{
        "terms":{
            "interests": ["hejiu","changge"]
        }
    }
}

match查询
先对输入进行分词，对分词后的结果进行查询，文档只要包含match查询条件的一部分就会被返回
term 不能对搜索的词进行分词比如查询条件是 hello world，那么只有在字段中存储了“hello world”的数据才会被返回，如果在存储时，使用了分词，原有的文本“I say hello world”会被分词进行存储，不会存在“hello world”这整个词，那么不会返回任何值。
match

GET /articleindex/article/_search
{
    "query":{
        "match":{
            "content": "理解 搜狗"
        }
    }
}

排序

GET /goodswares/type/_search
{
    "from": 6450,
    "size":10,
   "query":{
        "term":{
            "act_id": "158392028905316498"
        }
    },
    "sort": [
        {
           "modi_date": {
               "order":"desc"
           }
        }
    ]
}

bool查询

GET /goodswares/type/_search
{
      "post_filter": {
        "bool": {
          "must": [
            {"term": { "act_id": "157905227213516906"}},
            {"term": { "goods_id":"7683" }}
          ]
        }
      }
}

took: 查询耗费的时间毫秒
max_score：本次查询中，相关度分数的最大值，文档和此次查询的匹配度越高，_score的值越大，排位越靠前

java中使用elasticsearch

Elasticsearch Java API有四类client连接方式

TransportClient
RestClient
Jest
Spring Data Elasticsearch

其中TransportClient和RestClient是Elasticsearch原生的api , TransportClient将会在Elasticsearch 7.0弃用并在8.0中完成删除。替而代之，我们使用Java High Level REST Client
Jest是Java社区开发的，是Elasticsearch的Java Http Rest客户端；Spring Data Elasticsearch是spring集成的Elasticsearch开发包。
因此推荐使用REST Client或Spring Data Elasticsearch(用于一些简单业务，复杂不推荐）

Spring Data Elasticsearch的使用

pom

        <dependency>
            <groupId>org.springframework.data</groupId>
            <artifactId>spring-data-elasticsearch</artifactId>
        </dependency>

yml

spring:
  application:
    name: elaticsearch-demo
  data:
    elasticsearch:
      cluster-nodes: 127.0.0.1:9300

dto

@Document(indexName="articleindex",type="article",indexStoreType="fs",shards=5,replicas=1,refreshInterval="-1")
@Data
public class Article implements Serializable {
    @Id
    private String id;
    private String title;  //标题
    private String abstracts;  //摘要
    private String content;  //内容
    private Date postTime;  //发表时间
    private Long clickCount;  //点击率
    private Author author;	//作者
    private Tutorial tutorial;  //所属教程
}

dao

public interface ArticleSearchRepository extends ElasticsearchRepository<Article, String>{

    /**
     *查询所有(不分页)
     */
    Page<Article> findAll();

    /**
     *关键字查询(分页)
     */
    Page<Article> findByContent(String keyword, Pageable pageable);

    /**
     *关键字查询(不分页)
     */
    List<Article> findByContent(String keyword);
    
}

controller

@RestController
@RequestMapping("es")
public class EsController {
    @Autowired
    private ArticleSearchRepository articleSearchRepository;

    @RequestMapping("/save")
    public R save(@RequestBody Article article) {
        if (StringUtils.isBlank(article.getId())){
            article.setId(UniqueUUID.orderUUID());
        }
        articleSearchRepository.save(article);
        return R.success();
    }

    @GetMapping("/query")
    public R query( int page, int limit, String keyword) {
        Page<Article> articlePage = articleSearchRepository.findByContent(keyword,PageRequest.of(page, limit));
        return R.success(articlePage.getContent());
    }

    @DeleteMapping("/remove")
    public R remove(String id) {
        // articleSearchRepository.deleteById(id);默认支持long id
        Article article = new Article();
        article.setId(id);
        articleSearchRepository.delete(article);
        return R.success();
    }
}

常用查询规则

REST Client

参考
https://blog.csdn.net/weixin_43174967/article/details/90673517

使用logstash 同步数据到es中(windows环境)

es官网下载logstash
解压进入bin目录命令行运行logstash.bat
logstash -e "input { stdin { } } output { stdout {} }"
输入hello world测试
导入数据库连接插件
logstash-plugin install --no-verify logstash-input-jdbc
配置conf

input {
	jdbc {
		jdbc_driver_library => "C:\Users\13105\.m2\repository\com\microsoft\sqlserver\mssql-jdbc\7.4.1.jre8\mssql-jdbc-7.4.1.jre8.jar"
		jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
		jdbc_connection_string => "jdbc:sqlserver://122.51.180.139:1433;database=znxp;"
		jdbc_user => "walmart"
		jdbc_password => "walmart@EC20191qaz~"
		# 设置监听间隔  各字段含义（由左至右）分、时、天、月、年，全部为*默认含义为每分钟都更新
        schedule => "*/5 * * * * *"
        # 5秒一次
		statement => "SELECT * FROM [znxp]..[syb_users]"
	}
}
output {
	elasticsearch {
		hosts => "localhost:9200"
		index => "world"
		document_type => "type"
		document_id => "%{user_id}"
	}
}

运行测试
logstash.bat -f c:\software\elasticsearch-6.6.0\config\test_es.conf

数据同步问题

解决方案

双写实现
要么全量同步要么根绝修改时间同步
canal根据数据库日志同步，目前只支持mysql

性能测试

900w的数据

普通分页查询

在sqlserver下查询需要2.604s

select temp.* from (
select ROW_NUMBER() OVER(ORDER BY org_id DESC) Num,* from ssp_act_list_goodswares 
) temp
where (temp.Num-1)/10+1=1

在elasticsearch下 19ms

关键字分页查询

踩过的一些坑

一开始使用的elasticsearch 是最新版本的 7.6.0，启动demo后报错NoNodeAvailableException: None of the configured nodes are available，NoNodeAvailableException报错，主要是外网IP没有配置和引入的spring-boot-starter-data-elasticsearch和elastic版本不兼容，然后又去引入spring-boot-starter-data-elasticsearch，结果和spring-boot-parent有冲突，最后还是要注意版本对应。