ElasticSearch搜索引擎-5_学习笔记(2021.5.30)

最新推荐文章于 2023-08-25 09:32:12 发布

懵懵懂懂程序员

最新推荐文章于 2023-08-25 09:32:12 发布

阅读量172

点赞数

分类专栏： Java 部署

本文链接：https://blog.csdn.net/weixin_44600430/article/details/117407774

版权

Java 同时被 2 个专栏收录

74 篇文章 3 订阅

订阅专栏

部署

15 篇文章 0 订阅

订阅专栏

ElasticSearch搜索引擎-5_学习笔记(2021.5.30)

`IK` 分词器和`ElasticSearch`集成使用

前言: (IK官网)

lucene默认是单字分词，在开发中不符合查询的需求，需要定义一个支持中文的分词器。

IKAnalyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。

IK分词器有ik_max_word(最细切分）模式, 将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query

1.0 在官网上下载

下载elasticsearch-analysis-ik-7.13.0.zip 压缩包,

然后上传到Linux上面。

2.0 解压

解压elasticsearch-analysis-ik-7.13.0.zip 后, 将文件夹拷贝到elasticsearch-7.x.x\plugins下，并重命名文件夹为ik

重新启动ElasticSearch，即可加载IK分词器

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pVDs9OJe-1622388117810)(https://z3.ax1x.com/2021/05/30/2Zu01J.jpg)]

然后启动失败, 查看日志发现IK 下载的分词器与ES版本对不上。

得下载对应版本

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0aC7zN3a-1622388117811)(https://z3.ax1x.com/2021/05/30/2ZMCMd.jpg)]

3.0 快速示例

3.1 创建索引 (都是使用postman请求)

发送PUT请求 : http://119.29.xxx.xxx:9200/my_ik_test

3.2 创建映射

发送PUT请求 : http://119.29.xxx.xxx:9200/my_ik_test/_mapping

{
    "properties": {
        "title": {
            "type": "text",
            "index": true,
            "analyzer" : "ik_max_word",  # 使用最细分词
            "search_analyzer": "ik_smart" 
        },
        "content": {
            "type": "text",
            "index": false
        },
        "count": {
            "type": "long",
            "index": false
        }
    }
}

3.3 创建一些文档

发送PUT请求 : http://119.29.xxx.xxx:9200/my_ik_test/_doc

{
  "title":"中华人民共和国",
  "content":"我是文章内容",
  "count": 0
}

{
  "title":"人民：各地校车将享最高路权",
  "content":"我是文章内容",
  "count": 0
}

3.4 高亮查询

发送POST请求 : http://119.29.xxx.xxx:9200/my_ik_test/_search

{
    "query": {
        "match": {
            "title": "人民"
        }
    },
    "highlight": {
        "pre_tags": "<font color='red'>",
        "post_tags": "</font>",
        "fields": {
            "title": {}
        }
    }
}

结果

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.4923848,
        "hits": [
            {
                "_index": "my_ik_test",
                "_type": "_doc",
                "_id": "zYnLvXkBl3hPrYVmu4Wt",
                "_score": 0.4923848,
                "_source": {
                    "title": "人民：各地校车将享最高路权",
                    "content": "我是文章内容",
                    "count": 0
                },
                "highlight": {
                    "title": [
                        "<font color='red'>人民</font>：各地校车将享最高路权"
                    ]
                }
            },
            {
                "_index": "my_ik_test",
                "_type": "_doc",
                "_id": "yInAvXkBl3hPrYVm14Wi",
                "_score": 0.4700036,
                "_source": {
                    "title": "中华人民共和国",
                    "content": "我是文章内容",
                    "count": 0
                },
                "highlight": {
                    "title": [
                        "中华<font color='red'>人民</font>共和国"
                    ]
                }
            }
        ]
    }
}

4.0 对IK分词器，拓展词库

对应一些公司特定词, 进行了必要的分割了, 例如: 万师傅, 分词成为了, 万、师傅

这个时候就需要自定义扩展词库

需要配置 IKAnalyzer.cfg.xml 文件位于, {plugins}/ik/config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 多个用 ;custom/single_word_low_freq.dic -->
	<entry key="ext_dict">my.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">my_stop.dic</entry>
 	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
	<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>

{plugins}/ik/config/ 下创建my.dic 与 my_stop.dic

my.dic 与 my_stop.dic

使用UFT-8编码,

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-1nz3E1vH-1622388117816)(https://z3.ax1x.com/2021/05/30/2Z35jg.jpg)]

到此就配置完成, 重启ES。

如果希望不重启ES达到扩展词库, 得使用远程扩展, 远程扩展看官网

扩展:

在java代码中是创建映射的时候指定字段是否使用 `IK` 分词类型

@Document(indexName = "My_from_java", shards = 3, replicas = 1)
@Data
public class MyFromJava  implements Serializable {

    //必须有 id,这里的 id 是全局唯一的标识，等同于 es 中的"_id"
    @Id
    private Long id;
    /**
     * type : 字段数据类型
     * analyzer : 分词器类型 (IK分词)
     * index : 是否索引(默认:true)
     */
     @Field(type = FieldType.Text, analyzer = "ik_max_word")
    private String name;
    @Field(type = FieldType.Integer,index = false)
    private Integer age;
    // Keyword : 短语,不进行分词
    @Field(type = FieldType.Keyword,index = false)
    private String sex;
    @Field(type = FieldType.Text,index = true)
    private String address;
}

懵懵懂懂程序员

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
2
评论
ElasticSearch搜索引擎-5_学习笔记(2021.5.30)

ElasticSearch搜索引擎-5_学习笔记(2021.5.30)IK 分词器和ElasticSearch集成使用前言: (IK官网)lucene默认是单字分词，在开发中不符合查询的需求，需要定义一个支持中文的分词器。IKAnalyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。IK分词器有ik_max_word(最细切分）模式, 将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可
复制链接

扫一扫