相关资源下载
Solr-6.6.4工具包 百度网盘下载(提取码:acsx)
IKAnalyzer2012FF jar包 百度网盘下载(提取码:dshv)
IK-Analyzer-2012FF 源码 百度网盘下载(提取码:u9n5)
Solr启动
- 解压Solr工具包,进入solr-6.6.4/server/solr文件夹创建测试demo_core文件夹,然后复制configsets/sample_techproducts_configs/conf文件夹到demo_core下。
- 进入solr-6.6.4/bin,命令启动:solr start (默认端口8983或指定端口启动solr start -p 8981)
- 浏览器访问http://localhost:8983/solr,点击右侧Core Amin -> Add Core,设置属性并保存。(注意:第三步骤的core,要跟你第一步建立的文件夹同一个名字,其余值默认即可。)
- 回到demo_core文件夹,发现多出data文件夹和core.properties文件。data目录用来存放索引文件,core.properties内是demo_core的配置信息。
配置IK中文分词器
- Solr默认是不带IK中文分词器的,需要自己导入IK-Analyzer jar包并配置。
- 把下载的IKAnalyzer2012FF.jar复制到solr-6.6.4/server/solr-webapp/webapp/WEB-INF/lib文件夹下。
- 进入solr-6.6.4/server/solr/demo_core/conf,复制一份managed-schema副本作为备份,
- 编辑managed-schema文件,在文本最后添加IK中文分词器的fieldType。
<!-- IK中文分词器 --> <fieldType name="text_ik" class="solr.TextField"> <!--索引时候的分词器--> <analyzer type="index" isMaxWordLength="false" class="org.wltea.analyzer.lucene.IKAnalyzer"/> <!--查询时候的分词器--> <analyzer type="query" isMaxWordLength="true" class="org.wltea.analyzer.lucene.IKAnalyzer"/> </fieldType>
- 进入solr-6.6.4/bin,命令重启Solr:solr restart -p -8983
- 浏览器访问http://localhost:8983/solr,点击右侧Core Selector,选择demo_core -> Analysis,在FieldValue(Index) 中输入“帆布鞋”,在Select an Option中选择text_ik,执行右侧的Analyse Values。
配置动态IK分词器(动态填充)
- MAVEN创建IKAnalyzer6.6.4 java项目,把IKAnalyzer-2012FF源码和配置文件复制到新项目中。
- 配置pom.xml文件。
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.wltea.analyzer</groupId> <artifactId>ik-analyzer</artifactId> <version>6.6.4</version> <packaging>jar</packaging> <name>${project.artifactId}</name> <!-- FIXME change it to the project's website --> <url>http://www.example.com</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding> <maven.compiler.source>1.8</maven.compiler.source> <maven.compiler.target>1.8</maven.compiler.target> </properties> <dependencies> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>${project.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>${project.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-memory</artifactId> <version>${project.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-backward-codecs</artifactId> <version>${project.version}</version> </dependency> </dependencies> <build> <resources> <resource> <directory>src/main/java/org/wltea/analyzer/dic/</directory> <filtering>true</filtering> <includes> <include>*.dic</include> </includes> <targetPath>${project.build.directory}/classes/org/wltea/analyzer/dic</targetPath> </resource> <resource> <directory>src/main/resources</directory> <filtering>true</filtering> <targetPath>${project.build.directory}/classes/</targetPath> </resource> </resources> </build> </project>
- 由于Lecene版本问题,需要修改几处异常。
- IKAnalyzerIKTokenizerIKQueryExpressionParser
SWMCQueryBuilderLuceneIndexAndSearchDemo - 创建动态分词所需的IKTokenizerFactory.java,UpdateKeeper.java。
package org.wltea.analyzer.lucene; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.util.ResourceLoader; import org.apache.lucene.analysis.util.ResourceLoaderAware; import org.apache.lucene.analysis.util.TokenizerFactory; import org.apache.lucene.util.AttributeFactory; import org.wltea.analyzer.dic.Dictionary; import java.io.IOException; import java.io.InputStream; import java.util.*; import java.util.logging.Logger; /** * 增加IK扩展词库动态更新类 * * @Author: sunshuo * @Date: 2019/3/12 15:54 * @Version: 1.0 */ public class IKTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware, UpdateKeeper.UpdateJob { private final static Logger LOGGER = Logger.getLogger(IKTokenizerFactory.class.getName()); private boolean useSmart; private ResourceLoader loader; private long lastUpdateTime = -1L; private String conf; /** * Initialize this factory via a set of key-value pairs. * * @param args */ public IKTokenizerFactory(Map<String, String> args) { super(args); this.useSmart = getBoolean(args, "useSmart", false); this.conf = get(args, "conf"); System.out.println(String.format(":::ik:construction:::::::::::::::::::::::::: %s", this.conf)); } @Override public void inform(ResourceLoader loader) throws IOException { System.out.println(String.format(":::ik:::inform:::::::::::::::::::::::: %s", this.conf)); this.loader = loader; update(); if ((this.conf != null) && (!this.conf.trim().isEmpty())) { UpdateKeeper.getInstance().register(this); } } @Override public Tokenizer create(AttributeFactory factory) { return new IKTokenizer(factory, useSmart()); } /** * 执行更新词典操作 * * @throws IOException */ @Override public void update() throws IOException { Properties p = canUpdate(); if (p != null) { List<String> dicPaths = splitFileNames(p.getProperty("files")); List<InputStream> inputStreamList = new ArrayList<>(); for (String path : dicPaths) { if ((path != null) && (!path.isEmpty())) { InputStream is = this.loader.openResource(path); if (is != null) { inputStreamList.add(is); } } } if (!inputStreamList.isEmpty()) { Dictionary.reloadDic(inputStreamList); } } } /** * 检查是否要更新 * * @return */ private Properties canUpdate() { if (this.conf == null) { return null; } Properties p; InputStream confStream = null; try { p = new Properties(); confStream = this.loader.openResource(this.conf); p.load(confStream); } catch (IOException e) { System.err.println("IK parsing conf NullPointerException~~~~~" + Arrays.toString(e.getStackTrace())); return null; } finally { if (confStream != null) { try { confStream.close(); } catch (IOException ignored) { } } } String lastUpdate = p.getProperty("lastUpdate", "0"); Long t = new Long(lastUpdate); if (t > this.lastUpdateTime) { this.lastUpdateTime = t; String paths = p.getProperty("files"); if ((paths != null) && (!paths.trim().isEmpty())) { System.out.println("loading conf files success."); return p; } } this.lastUpdateTime = t; return null; } private boolean useSmart() { return useSmart; } } package org.wltea.analyzer.lucene; import java.io.IOException; import java.util.Vector; /** * 1分钟自动判断更新 * * @Author: sunshuo * @Date: 2019/3/12 15:55 * @Version: 1.0 */ public class UpdateKeeper implements Runnable { static final long INTERVAL = 60000L; private static UpdateKeeper singleton; Vector<UpdateJob> filterFactorys; Thread worker; private UpdateKeeper() { this.filterFactorys = new Vector<UpdateJob>(); this.worker = new Thread(this); this.worker.setDaemon(true); this.worker.start(); } public static UpdateKeeper getInstance() { if (singleton == null) { synchronized (UpdateKeeper.class) { if (singleton == null) { singleton = new UpdateKeeper(); return singleton; } } } return singleton; } public void register(UpdateJob filterFactory) { this.filterFactorys.add(filterFactory); } @Override public void run() { while (true) { try { Thread.sleep(INTERVAL); } catch (InterruptedException e) { e.printStackTrace(); } if (!this.filterFactorys.isEmpty()) { for (UpdateJob factory : this.filterFactorys) { try { factory.update(); } catch (IOException e) { e.printStackTrace(); } } } } } public interface UpdateJob { void update() throws IOException; } }
- 为IKTokenizer添加一个构造函数
- 为Dictionary添加一个更新词典的方法
- 运行IKAnalzyerDemo的main方法和LuceneIndexAndSearchDemo 的main方法,结果正常,则配置成功。
- 在项目目录下,命令运行mvn clean install -Dmaven.test.skip=true,打成jar包。
- 把新项目打好的jar包复制到solr-6.6.4/server/solr-webapp/webapp/WEB-INF/lib文件夹下。
- solr-6.6.4/server/solr/demo_core/conf文件夹下创建ik.conf和my.dic。(注:ik.conf和my.dic必须时utf-8(无BOM)编码,否则IKAnalyzer识别乱码。)Ik.conf内容如下:
(lastUpdate 最后一次修改批次,每次更新分词库需加1。files 自定义分词库文件地址,支持多个文件,中间以英文逗号给开。)lastUpdate=1 files=my.dic
- 编辑managed-schema文件,在文本最后添加动态IK中文分词器的fieldType。
<!-- 动态IK中文分词器 --> <fieldType name="text_ik_dm" class="solr.TextField"> <!--索引时候的分词器--> <analyzer type="index"> <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <!--查询时候的分词器--> <analyzer type="query"> <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType>
- 停止Solr服务(solr stop -p 8983),删除原有的IKAnalyzer2012FF_u2.jar,再次启动Solr服务。
- 浏览器访问http://localhost:8983/solr,点击右侧Core Selector,选择demo_core -> Analysis,在FieldValue(Index) 中输入“帆布鞋”,在Select an Option中选择text_ik_dm,执行右侧的Analyse Values。
- 在solr-6.6.4/server/solr/demo_core/conf文件夹下,编辑my.dic,添加一个“鞋”分词。编辑ik.conf,lastUpdate属性值加1。一分钟后再次执行第6步。