我的架构演化笔记 11：ES之ansj分词器之定制：动态支持StopWord及同义词功能

最新推荐文章于 2024-06-05 13:05:20 发布

weixin_34023982

最新推荐文章于 2024-06-05 13:05:20 发布

阅读量442

点赞数

文章标签： java 数据库 python

原文链接：https://my.oschina.net/qiangzigege/blog/280075

版权

2019独角兽企业重金招聘Python工程师标准>>>

上一篇文章提到过方法，本文单独拿出来作为一个主题。

架构如下：

这里ansj分词器为了支持动态添加词汇，使用了Redis组件。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

首先要明白动态支持意味着：

1）内存中支持动态增加/删除

2）文件中支持动态增加/删除

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

先解决第2个问题：文件动态支持

从AddTermRedisPubSub 类中知道文件支持是由FileUtils类支持的。

FiltUtils添加如下两个方法：

public static void appendStopWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_STOP_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			appendFile(content, file);
		} catch (IOException e) {
			logger.error("read exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

	public static void removeStopWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_STOP_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			removeFile(content, file, false);
		} catch (FileNotFoundException e) {
			logger.error("file not found $ES_HOME/config/ansj/stopLibrary.dic");
			e.printStackTrace();
		} catch (IOException e) {
			logger.error("read exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

测试过程中发现：添加一个停词，会打出一些不必要的日志：

[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswill
[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswith
[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswithin
[2014-06-16 11:59:13,847][INFO ][ansj-redis-msg-file      ] match is true text iswithout

于是将FileUtils类的removeFile方法的

logger.info("match is {} text is{}",
					new Object[] { Boolean.valueOf(match(content, text, head)),
							text });

注释掉即可。

AddTermRedisPubSub类添加：

else if ("stop".equals(msg[0])) {
			if ("c".equals(msg[1])) {
				// add one stopWord into memory
				AnsjElasticConfigurator.filter.add(msg[2]);
				// add one stopWord into file
				FileUtils.appendStopWord(msg[2]);
			} else if ("d".equals(msg[1])) {
				// remove one stopWord from memory
				AnsjElasticConfigurator.filter.remove(msg[2]);
				// remove one stopWod from file
				FileUtils.removeStopWord(msg[2]);
			}
		}

最后就是stopLibrary.dic的最后一行要添加一个换行符，否则后面添加的单词会跟原先最后一个单词位于同一行。

这样，就完成了动态支持redis添加停词的功能。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

下面介绍ansj如何添加同义词功能！

在Lucene4.6中通过lucene-analyzers-common-4.6.1.jar内的SynonymFilterFactory实现中文同义词非常方便，

只需几行代码和一个同义词词典。

~~~~~~~~~~~~~~~~~~~

首先，修改启动类：AnsjElasticConfigurator

public static SynonymFilterFactory factory = null;
	public static String DEFAULT_SYNONYM_FILE_LIB_PATH = "ansj/synonyms.dic";

	public static void loadSynonymFilter(Settings settings) {
		Version ver = Version.LUCENE_46;
		Map<String, String> filterArgs = new HashMap<String, String>();
		filterArgs.put("luceneMatchVersion", ver.toString());
		File path = new File(environment.configFile(), settings.get("synonyms",
				DEFAULT_SYNONYM_FILE_LIB_PATH));
		filterArgs.put("synonyms", path.getAbsolutePath());
		logger.info("synonyms.dict absolute path: " + path.getAbsolutePath());
		filterArgs.put("expand", "true");
		factory = new SynonymFilterFactory(filterArgs);
		try {
			factory.inform(new FilesystemResourceLoader());
		} catch (Exception e) {
			// Exception happens here!
			logger.info("load ansj/synonyms.dic fail,detail is as follows:"
					+ e.toString());
		}
	}

	public static void init(Settings indexSettings, Settings settings) {
		if (isLoaded()) {
			return;
		}
		environment = new Environment(indexSettings);
		initConfigPath(settings);
		loadFilter(settings);
		loadSynonymFilter(settings);
		try {
			preheat();
			logger.info("ansj preheat done! It can be used now!");
		} catch (Exception e) {
			logger.error("ansj preheat fail,please check file path.");
		}
		initRedis(settings);
		setLoaded(true);
	}

编译成功。

将编译好的2个class文件放入到elasticsearch-analysis-ansj-0.2.jar中，替换相应的文件即可。

紧接着修改：AnsjIndexAnalysis.java

@Override
	protected TokenStreamComponents createComponents(String fieldName,
			final Reader reader) {
		// TODO Auto-generated method stub
		Tokenizer tokenizer = new AnsjTokenizer(new IndexAnalysis(
				new BufferedReader(reader)), reader, filter, pstemming);
		return new TokenStreamComponents(tokenizer,
				AnsjElasticConfigurator.factory.create(tokenizer));
	}

AnsjAnalysis.java

@Override
	protected TokenStreamComponents createComponents(String fieldName,
			final Reader reader) {
		// TODO Auto-generated method stub
		Tokenizer tokenizer = new AnsjTokenizer(new ToAnalysis(
				new BufferedReader(reader)), reader, filter, pstemming);
		// add by smallblack

		return new TokenStreamComponents(tokenizer,
				AnsjElasticConfigurator.factory.create(tokenizer));
	}

编译成功后放入ansj_lucene4_plug-1.3.jar，替换相应文件即可。

然后启动es之前务必在ansj下添加synonyms.dic文件。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~但是目前只是静态支持，我们希望动态支持。

先修改FileUtils.java文件

public static void appendSynonymWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_SYNONYM_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			appendFile(content, file);
		} catch (IOException e) {
			logger.error("read ansj/synonyms.dic exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

	public static void removeSynonymWord(String content) {
		try {
			File file = new File(
					AnsjElasticConfigurator.environment.configFile(),
					AnsjElasticConfigurator.DEFAULT_SYNONYM_FILE_LIB_PATH);
			// "ansj/stopLibrary.dic");
			removeFile(content, file, false);
		} catch (FileNotFoundException e) {
			logger.error("file not found $ES_HOME/config/ansj/synonyms.dic");
			e.printStackTrace();
		} catch (IOException e) {
			logger.error("read exception", e, new Object[0]);
			e.printStackTrace();
		}
	}

然后修改AddTermRedisPubSub.java文件

} else if ("stop".equals(msg[0])) {
			if ("c".equals(msg[1])) {
				AnsjElasticConfigurator.filter.add(msg[2]);
				FileUtils.appendStopWord(msg[2]);
			} else if ("d".equals(msg[1])) {
				AnsjElasticConfigurator.filter.remove(msg[2]);
				FileUtils.removeStopWord(msg[2]);
			}
		} else if ("syn".equals(msg[0])) {
			if ("c".equals(msg[1])) {
				FileUtils.appendSynonymWord(msg[2]);
			} else if ("d".equals(msg[1])) {
				FileUtils.removeSynonymWord(msg[2]);
			}
			AnsjElasticConfigurator.factory
					.inform(new FilesystemResourceLoader());
		}

编译，加入到elasticsearch-analysis-ansj-0.2.jar.

测试结果：