Elasticsearch之近义词/同义词的使用

本文介绍了在Elasticsearch 7.6.1中如何配置和使用自定义同义词分析器,包括创建索引、批量插入数据、检索数据,并详细讲解了如何在不重启Elasticsearch的情况下,通过动态更新同义词文件实现同义词词典的热更新,以及遇到的版本兼容问题和解决方法。
摘要由CSDN通过智能技术生成

Elasticsearch之近义词/同义词的使用

1、环境:

ES 7.6.1
kinaba 7.6.1
centos 7

2、未配置同义词分析器的情况

2.1、创建索引

PUT /test_001
{
  "settings": {
    "index": {
      "max_result_window": 1000000
    },
    "analysis": {
      "analyzer": {
        "ik_max_word": {
          "tokenizer": "ik_max_word",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "goodsName": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

2.2、添加数据

向test_001索引中批量添加数据

POST _bulk
{ "index" : { "_index" : "test_001","_id":1} }
{"id" : 1,"goodsName" : "克而瑞"}
{ "index" : { "_index" : "test_001","_id":2} }
{"id" : 2,"goodsName" : "随便"}
{ "index" : { "_index" : "test_001","_id":3} }
{"id" : 3,"goodsName" : "CRAC"}

2.3、检索数据

查看克而瑞相关信息

GET test_001/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "goodsName": {
                    "query": "克而瑞",
                    "boost": 1
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

结果中没有和克而瑞同义的CRAC的信息
在这里插入图片描述

3、使用自定义的同义词分析器

3.1、配置同义词文件

进入到es安装目录的config目录下,

mkdir analysis
cd analysis/
vim my-synonym.txt

my-synonym.txt文件中填写近义词同义词,内容如下:

[root@hadoop180 analysis]# cat my-synonym.txt
搜房,房天下
成交均价,成交单价,房价,售价
保障房,经济适用房,配套商品房,动迁房,廉租房
出租,租赁
买卖,销售
克而瑞,CRAC
[root@hadoop180 analysis]#

3.2、创建索引

建立索引test_002 ,与索引 test001的主要区别在与
在这里插入图片描述

PUT /test_002
{
  "settings": {
    "index": {
      "max_result_window": 1000000
    },
    "analysis": {
      "analyzer": {
        "ik_max_word": {
          "tokenizer": "ik_max_word",
          "filter": [
            "lowercase",
            "asciifolding",
            "my_synonym_filter"
          ]
        }
      },
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms_path": "analysis/my-synonym.txt"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "goodsName": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

3.3、批量插入测试数据

向 test_002 索引中批量插入数据

POST _bulk
{ "index" : { "_index" : "test_002","_id":1} }
{"id" : 1,"goodsName" : "克而瑞"}
{ "index" : { "_index" : "test_002","_id":2} }
{"id" : 2,"goodsName" : "随便"}
{ "index" : { "_index" : "test_002","_id":3} }
{"id" : 3,"goodsName" : "CRAC"}

3.3、检索数据测试

查询克而瑞相关信息,发现

GET test_002/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "goodsName": {
                    "query": "克而瑞",
                    "boost": 1
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

发现和克而瑞 同义的CRAC的信息也被检索出来了
在这里插入图片描述

3.4、新增同义词如何处理?

新添加同义词到 my-synonym.txt 文件中
在这里插入图片描述
批量插入三条数据

POST _bulk
{ "index" : { "_index" : "test_002","_id":4} }
{"id" : 4,"goodsName" : "西红柿"}
{ "index" : { "_index" : "test_002","_id":5} }
{"id" : 5,"goodsName" : "番茄"}
{ "index" : { "_index" : "test_002","_id":6} }
{"id" : 6,"goodsName" : "洋柿子"}

查询西红柿相关的信息时,发现只检索到西红柿的记录,没有包含 番茄 和 洋柿子,此时,重新启动es后,再次执行查询语句,发现同义词的信息可以正常检索到

GET test_002/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "goodsName": {
                    "query": "西红柿",
                    "boost": 1
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

在这里插入图片描述

3.5 如何不重启es的情况下,动态更新同义词词典?

前期准备:
将同义词词典作为Tomcat的静态资源
同义词词典的访问路径为 http://192.168.6.180:8500/upload/text/my-remote-synonym.txt
my-remote-synonym.txt文件的内容如下:
在这里插入图片描述

使用插件elasticsearch-analysis-dynamic-synonym
github地址:https://github.com/bells/elasticsearch-analysis-dynamic-synonym
克隆下后,修改pom.xml中的为安装的ES的版本,因为该项目中引入的ES依赖版本引用了项目的版本
pom.xml中部分内容如下

    <groupId>com.bellszhu.elasticsearch</groupId>
    <artifactId>elasticsearch-analysis-dynamic-synonym</artifactId>
    <version>7.6.1</version>
    <packaging>jar</packaging>
    <name>elasticsearch-dynamic-synonym</name>
    <description>Analysis-plugin for synonym</description>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <elasticsearch.version>${project.version}</elasticsearch.version>
        <maven.compiler.target>1.8</maven.compiler.target>
        <elasticsearch.plugin.name>analysis-dynamic-synonym</elasticsearch.plugin.name>
        <elasticsearch.assembly.descriptor>${project.basedir}/src/main/assemblies/plugin.xml
        </elasticsearch.assembly.descriptor>
        <elasticsearch.plugin.classname>com.bellszhu.elasticsearch.plugin.DynamicSynonymPlugin
        </elasticsearch.plugin.classname>
        <elasticsearch.plugin.jvm>true</elasticsearch.plugin.jvm>
    </properties>

参考README.md文件
在这里插入图片描述

按常理修改了项目的version后编译打包后将target/releases包下的zip包解压到es安装目录下的plugins/dynamic-synonym
克隆下来的代码默认版本时7.13.2,不修改版本打包正常
由于我是用了ES是7.6.1的版本,修改版本后,编译报错,原因:
在导入import org.elasticsearch.common.logging.DeprecationCategory;时报错
DeprecationCategory该类在7.13.2版本中有,在7.6.1版本中没有该类,看了下7.13.2版本中的代码实现,修改了下代码(主要修改了 private static final DeprecationLogger DEPRECATION_LOGGER 属性赋值的方法),修改后的代码如下:

package com.bellszhu.elasticsearch.plugin.synonym.analysis;


import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.WeakHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.ScheduledFuture;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.function.Function;

import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.synonym.SynonymMap;
// import org.elasticsearch.common.logging.DeprecationCategory;
import org.elasticsearch.common.logging.DeprecationLogger;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
import org.elasticsearch.index.analysis.AnalysisMode;
import org.elasticsearch.index.analysis.CharFilterFactory;
import org.elasticsearch.index.analysis.CustomAnalyzer;
import org.elasticsearch.index.analysis.TokenFilterFactory;
import org.elasticsearch.index.analysis.TokenizerFactory;

/**
 * @author bellszhu
 */
public class DynamicSynonymTokenFilterFactory extends
        AbstractTokenFilterFactory {

    private static final DeprecationLogger DEPRECATION_LOGGER = getLogger(
            DynamicSynonymTokenFilterFactory.class);
    private static Logger logger = LogManager.getLogger("dynamic-synonym");

    // private DeprecationLogger(String parentLoggerName) {
    //     this.logger = LogManager.getLogger(getLoggerName(parentLoggerName));
    // }

    public static DeprecationLogger getLogger(Class<?> aClass) {
        return getLogger(toLoggerName(aClass));
    }

    public static DeprecationLogger getLogger(String name) {
        return new DeprecationLogger(LogManager.getLogger(name));
    }

    private static String toLoggerName(Class<?> cls) {
        String canonicalName = cls.getCanonicalName();
        return canonicalName != null ? canonicalName : cls.getName();
    }


    private static String getLoggerName(String name) {
        if (name.startsWith("org.elasticsearch")) {
            name = name.replace("org.elasticsearch.", "org.elasticsearch.deprecation.");
        } else {
            name = "deprecation." + name;
        }

        return name;
    }

    /**
     * Static id generator
     */
    private static final AtomicInteger id = new AtomicInteger(1);
    private static ScheduledExecutorService pool = Executors.newScheduledThreadPool(1, r -> {
        Thread thread = new Thread(r);
        thread.setName("monitor-synonym-Thread-" + id.getAndAdd(1));
        return thread;
    });
    private volatile ScheduledFuture<?> scheduledFuture;

    private final String location;
    private final boolean expand;
    private final boolean lenient;
    private final String format;
    private final int interval;
    protected SynonymMap synonymMap;
    protected Map<AbsSynonymFilter, Integer> dynamicSynonymFilters = new WeakHashMap<>();
    protected final Environment environment;
    protected final AnalysisMode analysisMode;

    public DynamicSynonymTokenFilterFactory(
            IndexSettings indexSettings,
            Environment env,
            String name,
            Settings settings
    ) throws IOException {
        super(indexSettings, name, settings);

        this.location = settings.get("synonyms_path");
        if (this.location == null) {
            throw new IllegalArgumentException(
                    "dynamic synonym requires `synonyms_path` to be configured");
        }
        if (settings.get("ignore_case") != null) {
            DEPRECATION_LOGGER.deprecated(
                    "ANALYSIS",
                    "dynamic synonym ignore_case",
                    "The ignore_case option on the synonym_graph filter is deprecated. " +
                            "Instead, insert a lowercase filter in the filter chain before the synonym_graph filter."
            );
        }

        this.interval = settings.getAsInt("interval", 60);
        this.expand = settings.getAsBoolean("expand", true);
        this.lenient = settings.getAsBoolean("lenient", false);
        this.format = settings.get("format", "");
        boolean updateable = settings.getAsBoolean("updateable", false);
        this.analysisMode = updateable ? AnalysisMode.SEARCH_TIME : AnalysisMode.ALL;
        this.environment = env;
    }

    @Override
    public AnalysisMode getAnalysisMode() {
        return this.analysisMode;
    }


    @Override
    public TokenStream create(TokenStream tokenStream) {
        throw new IllegalStateException(
                "Call getChainAwareTokenFilterFactory to specialize this factory for an analysis chain first");
    }

    @Override
    public TokenFilterFactory getChainAwareTokenFilterFactory(
            TokenizerFactory tokenizer,
            List<CharFilterFactory> charFilters,
            List<TokenFilterFactory> previousTokenFilters,
            Function<String, TokenFilterFactory> allFilters
    ) {
        final Analyzer analyzer = buildSynonymAnalyzer(tokenizer, charFilters, previousTokenFilters, allFilters);
        synonymMap = buildSynonyms(analyzer);
        final String name = name();
        return new TokenFilterFactory() {
            @Override
            public String name() {
                return name;
            }

            @Override
            public TokenStream create(TokenStream tokenStream) {
                // fst is null means no synonyms
                if (synonymMap.fst == null) {
                    return tokenStream;
                }
                DynamicSynonymFilter dynamicSynonymFilter = new DynamicSynonymFilter(tokenStream, synonymMap, false);
                dynamicSynonymFilters.put(dynamicSynonymFilter, 1);

                return dynamicSynonymFilter;
            }

            @Override
            public TokenFilterFactory getSynonymFilter() {
                // In order to allow chained synonym filters, we return IDENTITY here to
                // ensure that synonyms don't get applied to the synonym map itself,
                // which doesn't support stacked input tokens
                return IDENTITY_FILTER;
            }

            @Override
            public AnalysisMode getAnalysisMode() {
                return analysisMode;
            }
        };
    }

    Analyzer buildSynonymAnalyzer(
            TokenizerFactory tokenizer,
            List<CharFilterFactory> charFilters,
            List<TokenFilterFactory> tokenFilters,
            Function<String, TokenFilterFactory> allFilters
    ) {
        return new CustomAnalyzer(
                tokenizer,
                charFilters.toArray(new CharFilterFactory[0]),
                tokenFilters.stream().map(TokenFilterFactory::getSynonymFilter).toArray(TokenFilterFactory[]::new)
        );
    }

    SynonymMap buildSynonyms(Analyzer analyzer) {
        try {
            return getSynonymFile(analyzer).reloadSynonymMap();
        } catch (Exception e) {
            logger.error("failed to build synonyms", e);
            throw new IllegalArgumentException("failed to build synonyms", e);
        }
    }

    SynonymFile getSynonymFile(Analyzer analyzer) {
        try {
            SynonymFile synonymFile;
            if (location.startsWith("http://") || location.startsWith("https://")) {
                synonymFile = new RemoteSynonymFile(
                        environment, analyzer, expand, lenient,  format, location);
            } else {
                synonymFile = new LocalSynonymFile(
                        environment, analyzer, expand, lenient, format, location);
            }
            if (scheduledFuture == null) {
                scheduledFuture = pool.scheduleAtFixedRate(new Monitor(synonymFile),
                                interval, interval, TimeUnit.SECONDS);
            }
            return synonymFile;
        } catch (Exception e) {
            logger.error("failed to get synonyms: " + location, e);
            throw new IllegalArgumentException("failed to get synonyms : " + location, e);
        }
    }

    public class Monitor implements Runnable {

        private SynonymFile synonymFile;

        Monitor(SynonymFile synonymFile) {
            this.synonymFile = synonymFile;
        }

        @Override
        public void run() {
            if (synonymFile.isNeedReloadSynonymMap()) {
                synonymMap = synonymFile.reloadSynonymMap();
                for (AbsSynonymFilter dynamicSynonymFilter : dynamicSynonymFilters.keySet()) {
                    dynamicSynonymFilter.update(synonymMap);
                    logger.info("success reload synonym");
                }
            }
        }
    }

}

将打包后的zip解压到 es的plugins/dynamic-synonym目录下后,重启ES
创建es索引:
在这里插入图片描述
说明:
synonyms_path 必须,根据它的值是否是以http://开头来判断是本地文件,还是远程文件。
interval 非必须,默认值是60,单位秒,表示间隔多少秒去检查同义词文件是否有更新。
ignore_case 非必须, 默认值是false
expand 非必须, 默认值是true
format 非必须, 默认值是空字符串, 如果为wordnet,则表示WordNet结构的同义词。
热更新
1.对于本地文件:主要通过文件的修改时间戳(Modify time)来判断是否要重新加载。
2.对于远程文件:synonyms_path 是指一个url。 这个http请求需要返回两个头部,一个是Last-Modified,一个是ETag,只要有一个发生变化,该插件就会去获取新的同义词来更新相应的同义词。
3.同义词更新后索引中的字段不会立即生效,需要reindex或者覆盖更新

PUT /test_003
{
  "settings": {
    "index": {
      "max_result_window": 1000000
    },
    "analysis": {
      "analyzer": {
        "ik_max_word": {
          "tokenizer": "ik_max_word",
          "filter": [
            "lowercase",
            "asciifolding",
            "my_synonym_filter"
          ]
        }
      },
      "filter": {
        "my_synonym_filter": {
          "type": "dynamic_synonym",
          "interval": 30,
          "synonyms_path": "http://192.168.6.180:8500/upload/text/my-remote-synonym.txt"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "goodsName": {
        "type": "text",
        "analyzer": "ik_max_word"
      }
    }
  }
}

插入两条测试数据

POST _bulk
{ "index" : { "_index" : "test_003","_id":1} }
{"id" : 1,"goodsName" : "地瓜"}
{ "index" : { "_index" : "test_003","_id":2} }
{"id" : 2,"goodsName" : "红薯"}

测试查询地瓜的数据,发现地瓜和红薯两条记录都有查询到

GET test_003/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "goodsName": {
                    "query": "地瓜",
                    "boost": 1
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

在这里插入图片描述
动态更新同义词库
修改Tomcat中的静态资源文件 my-remote-synonym.txt
新增 汤圆,元宵
在这里插入图片描述
不重启es,查看es的运行日志,可以看到日志中reload remote synonym:
es已经重新加载了远程同义词词典
在这里插入图片描述
再向es中添加两条测试数据

POST _bulk
{ "index" : { "_index" : "test_003","_id":3} }
{"id" : 3,"goodsName" : "汤圆"}
{ "index" : { "_index" : "test_003","_id":4} }
{"id" : 4,"goodsName" : "元宵"}

测试查询汤圆,发现汤圆和元宵的信息都被检索出来

GET test_003/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "match": {
                  "goodsName": {
                    "query": "汤圆",
                    "boost": 1
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
}

在这里插入图片描述

参考:
elasticsearch–动态同义词

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码到成功>_<

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值