ElasticSearch中文分词处理

最新推荐文章于 2024-04-18 18:48:01 发布

DanielMaster

最新推荐文章于 2024-04-18 18:48:01 发布

阅读量777

点赞数 1

分类专栏： ElasticSearch 文章标签： elasticsearch 搜索引擎

本文链接：https://blog.csdn.net/a805814077/article/details/110004226

版权

ElasticSearch 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

场景模拟：

文本分词是进行文本挖掘的核心基础，ES在查询中文的时候基本查询不出数据，是因为ES是需要对每一句话进行分词，拆分后才能够进行查询解析。因为底层依赖 lucene，所以中文分词效果不佳，但是有比较好的分词插件，比较好的中文分词有 IK，庖丁解牛中文分词等等。

这里主要介绍ik插件分词

首先创建一个索引库

curl -H "Content-Type: application/json" -XPUT 'http://hadoop01:9200/chinese'

然后添加几条数据

curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/1 -d'{"content":"美国留给伊拉克的是个烂摊子吗"}'
curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/2 -d'{"content":"公安部：各地校车将享最高路权"}'
curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/3 -d'{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}'
curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/4 -d'{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'

这里使用API查询

ElasticSearchUtil.java

package com.daniel.util;

import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.TransportAddress;
import org.elasticsearch.transport.client.PreBuiltTransportClient;

import java.io.IOException;
import java.net.InetAddress;
import java.util.LinkedList;
import java.util.Properties;

/**
 * @Author Daniel
 * @Description 连接到ElasticSearch集群
 **/
public class ElasticSearchUtil {
    // 连接池
    private static LinkedList<TransportClient> pool = new LinkedList<>();

    static {
        String CLUSTER_NAME = "cluster.name";
        String CLUSTER_HOSTS_PORT = "cluster.hosts.port";
        Properties properties = new Properties();
        try {
            // 加载配置文件
            properties.load(ElasticSearchUtil.class.getClassLoader().getResourceAsStream("elastic.properties"));

            Settings setting = Settings.builder()
                    // 如果集群的cluster.name和elasticsearch不同，需要手动指定
                    .put(CLUSTER_NAME, properties.getProperty(CLUSTER_NAME))
                    .build();
            // 入口
            TransportClient client;
            for (int i = 0; i < 5; i++) {
                client = new PreBuiltTransportClient(setting);
                // 指定es集群的地址
                String[] hostAndPorts = properties.getProperty(CLUSTER_HOSTS_PORT).split(",");
                for (String hostAndPort : hostAndPorts) {
                    String host = hostAndPort.split(":")[0];
                    int port = Integer.valueOf(hostAndPort.split(":")[1]);
                    TransportAddress trans = new TransportAddress(InetAddress.getByName(host), port);
                    client.addTransportAddress(trans);
                }
                pool.push(client);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    public static TransportClient getClient() {
        while (pool.isEmpty()) {
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
        return pool.poll();
    }

    public static void release(TransportClient client) {
        pool.push(client);
    }
}

elastic.properties

cluster.name=bde-es
cluster.hosts.port=hadoop01:9300,hadoop02:9300,hadoop03:9300

ChineseSegmentation.java

package com.daniel.api;

import com.daniel.util.ElasticSearchUtil;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.SearchHits;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

import java.net.UnknownHostException;


/**
 * @Author Daniel
 * @Description 全文索引——中文分词
 **/
public class ChineseSegmentation {
    private TransportClient client;

    public static void main(String[] args) {
        new ChineseSegmentation().search();
    }

    public void search() {
        client = ElasticSearchUtil.getClient();
        String[] indices = {"chinese"};
        SearchResponse response = client.prepareSearch(indices)
                .setSearchType(SearchType.QUERY_THEN_FETCH)
                .setQuery(QueryBuilders.termQuery("content", "中国"))
                .get();
        SearchHits searchHits = response.getHits();
        long totalHits = searchHits.totalHits;
        System.out.println("Daniel为您找到相关结果约" + totalHits + "个");
        float maxScore = searchHits.getMaxScore();
        System.out.println("最大得分：" + maxScore);
        SearchHit[] hits = searchHits.getHits();
        for (SearchHit hit : hits) {
            System.out.println("-------------------------------------");
            String index = hit.getIndex();
            String type = hit.getType();
            String id = hit.getId();
            float score = hit.getScore();
            String source = hit.getSourceAsString();
            System.out.printf("index:\t%s\n", index);
            System.out.printf("type:\t %s\n", type);
            System.out.printf("id:\t%s\n", id);
            System.out.printf("score:\t%f\n", score);
            System.out.printf("source:\t%s\n", source);
        }
        ElasticSearchUtil.release(client);

    }


}

在这里插入图片描述

可以看到找不到带有“中国”的新闻，这是因为虽然termQuery能不分词检索，但是它只是对检索的内容有效，并不能对索引库分词，所以查询不到

解决办法：

使用ik插件进行分词

下载

下载对应版本的插件我这里使用的是elasticsearch-analysis-ik-6.5.2

下载链接：https://github.com/medcl/elasticsearch-analysis-ik
编译

这里是源码包，需要我们自己使用maven进行编译

Maven的安装与配置
```
mvn clean package -DskipTests
```
编译完成后进入target下面的releases的文件夹，将elasticsearch-analysis-ik-6.5.0上传到linux，这里用rz命令上传
```
rz 
```

解压

先给ik单独创建一个文件夹

mkdir /home/hadoop/apps/elasticsearch/plugins/ik

然后解压

unzip -d /home/hadoop/apps/elasticsearch/plugins/ik ~/apps/elasticsearch-analysis-ik-6.5.0.zip

修改配置文件

编译之后版本号变为了6.5.0，我的es的版本为6.5.2所以得修改一下版本号
```
vim /home/hadoop/apps/elasticsearch/plugins/ik/plugin-descriptor.properties
```

在这里插入图片描述

同步到其他机器

 scp -r ik hadoop@hadoop02:/home/hadoop/apps/elasticsearch/plugins
 scp -r ik hadoop@hadoop03:/home/hadoop/apps/elasticsearch/plugins

重启

重启es集群，直接手动kill掉后再启动

删除索引库

curl -XDELETE http://hadoop01:9200/chinese

加载ik插件

curl -XPOST http://hadoop01:9200/chinese/fulltext/_mapping -H 'Content-Type:application/json' -d'{"_all": {"enabled": "false"},"properties": {"content": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_max_word"}}}'

重新添加数据

curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/1 -d'{"content":"美国留给伊拉克的是个烂摊子吗"}'
curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/2 -d'{"content":"公安部：各地校车将享最高路权"}'
curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/3 -d'{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}'
curl -H "Content-Type: application/json" -XPOST http://hadoop01:9200/chinese/fulltext/4 -d'{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'