solr8.5.0搭建以及配置IK最新分词器

最新推荐文章于 2024-06-28 10:02:40 发布

请持续率性

最新推荐文章于 2024-06-28 10:02:40 发布

阅读量1.9k

点赞数

分类专栏： solr java 文章标签： solr ik分词器全文检索搜索引擎

本文链接：https://blog.csdn.net/k21325/article/details/105352617

版权

java 同时被 2 个专栏收录

35 篇文章 0 订阅

订阅专栏

solr

8 篇文章 0 订阅

订阅专栏

新闻全文检索服务

1.需要索引的字段
客户端ID info_classify.app_id
客户端名 app_info.name
栏目ID info_classify.columns_id
栏目名 columninfo.columnName
新闻内容 info_classify.content_text
创建时间 info_classify.create_time
ID info_classify.id
标签 info_classify.info_label
新闻ID info_classify.information_id
新闻标题 info_classify.list_title
列表显示类型 info_classify.list_view_type
上线时间 info_classify.online_time
状态 info_classify.status
新闻摘要 information.synopsis

2.搭建solr搜索引擎服务

3.编写数据同步服务
- 首次同步是全量同步，分页同步,每页1000条吧;
- 后续同步是最大ID同步，查询出索引中最大的ID，再查询比这个ID大的新闻进行索引

4.政务云cms接入搜索引擎，对新闻进行检索，操作按钮需要进一步确认

solr版本：https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/solr/8.5.0/solr-8.5.0.tgz
tomcat版本：https://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-9/v9.0.33/bin/apache-tomcat-9.0.33.tar.gz
solr文件地址：/var/www/file/images2/solr/solr-8.5.0/
程序地址：/usr/local/server/apache-tomcat-solr
solr-core-home:
/var/www/file/images2/solr/cores/info
/var/www/file/images2/solr/solr-8.5.0/server/solr/info/

5.配置solr-home

<env-entry>
  <env-entry-name>solr/home</env-entry-name>
  <env-entry-value>/var/www/file/images2/solr/cores</env-entry-value>
  <env-entry-type>java.lang.String</env-entry-type>
</env-entry>

6.重启solr服务

/var/www/file/images2/solr/solr-8.5.0/bin/solr restart -force -m 4g

7.删除所有数据：
1)documents type 选择 XML
2)documents 输入下面语句

<delete><query>*:*</query></delete>
<commit/>

3)点击Submit Document 即可

8.100%匹配关键字查询
q ： title:通远门
Raw Query Parameters ： defType=edismax&mm=100%

9.设置smartcn分词器
9.1.复制自带的jar包

cp /var/www/file/images2/solr/solr-8.5.0/contrib/analysis-extras/lucene-libs/lucene-analyzers-smartcn-8.5.0.jar /var/www/file/images2/solr/solr-8.5.0/server/solr-webapp/webapp/WEB-INF/lib/

9.2.修改managed-schema ，配置字段类型

<!-- 配置中文分词器 -->
<fieldType name="text_smartcn" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>
    </analyzer>
</fieldType>

9.3.修改managed-schema ，指定字段类型
<field name="content" type="text_smartcn" indexed="true" stored="true"/>
10.jar包启动脚本

nohup /usr/local/jdk1.8.0_181/bin/java \
-XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=128m \
-Xms1024m -Xmx1024m -Xmn256m -Xss256k \
-XX:SurvivorRatio=8 -XX:+UseConcMarkSweepGC \
-XX:+PrintGCDateStamps -XX:+PrintGCDetails \
-verbose:gc -Xloggc:/var/www/logs/cqliving-cloud-solr/gc.log \
-XX:+HeapDumpOnOutOfMemoryError \
-XX:HeapDumpPath=/var/www/logs/cqliving-cloud-solr/oom.hprof \
-Dspring.profiles.active=prod \
-jar cqliving-cloud-solr-1.0.1-SNAPSHOT.jar  1>"/var/www/logs/cqliving-cloud-solr/console.log" 2>&1 </dev/null &

11.配置IK分词器
1、下载ik分词器：https://github.com/magese/ik-analyzer-solr

<dependency>
    <groupId>com.github.magese</groupId>
    <artifactId>ik-analyzer</artifactId>
    <version>8.3.0</version>
</dependency>

2、将下载下来的jar包复制到solr\WEB-INF\lib中
2.1、将resources目录下的5个配置文件放入solr服务的Jetty或Tomcat的webapp/WEB-INF/classes/目录下
① IKAnalyzer.cfg.xml
② ext.dic
③ stopword.dic
④ ik.conf
⑤ dynamicdic.txt

注意：

- 修改dynamicdic.txt，同步修改ik.conf的lastupdate字段，可以设置为时间戳
- 修改词典后需要重建索引才能应用新词

3、配置Solr的managed-schema，添加ik分词器:

    <field name="id" type="pint" indexed="true" stored="true" required="true" multiValued="false" />
    <!-- docValues are enabled by default for long type so we don't need to index the version field  -->
    <field name="_version_" type="plong" indexed="false" stored="false"/>
    <!-- If you don't use child/nested documents, then you should remove the next two fields:  -->
    <!-- for nested documents (minimal; points to root document) -->
    <field name="_root_" type="pint" indexed="true" stored="false" docValues="false" />
    <!-- for nested documents (relationship tracking) -->
    <field name="_nest_path_" type="_nest_path_" />
    <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

    <field name="appId" type="plong" indexed="true" stored="true"/>
    <field name="appName" type="text_ik" indexed="true" stored="true"/>
    <field name="columnName" type="text_ik" indexed="true" stored="true"/>
    <field name="columnsId" type="plong" indexed="true" stored="true"/>
    <field name="contentText" type="text_ik" indexed="true" stored="true"/>
    <field name="createdTime" type="pdate" indexed="true" stored="true"/>
    <field name="infoLabel" type="text_ik" indexed="true" stored="true"/>
    <field name="informationId" type="plong" indexed="true" stored="true"/>
    <field name="title" type="text_ik" indexed="true" stored="true"/>
    <field name="listViewType" type="plong" indexed="true" stored="true"/>
    <field name="onlineTime" type="pdate" indexed="true" stored="true"/>
    <field name="status" type="pint" indexed="true" stored="true"/>
    <field name="synopsis" type="text_ik" indexed="true" stored="true"/>
    
    <!-- ik分词器 -->
    <fieldType name="text_ik" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="false" conf="ik.conf"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="org.wltea.analyzer.lucene.IKTokenizerFactory" useSmart="true" conf="ik.conf"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

4、然后将field的type修改成text_ik

5、配置扩展词典
5.1、扩展词 ext.dic
5.2、停用词 stopword.dic
5.3、配置文件
vim /WEB-INF/classes/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!-- 配置是否加载默认词典 -->
    <entry key="use_main_dict">true</entry>
    <!--用户可以在这里配置自己的扩展字典，表示使用哪些词来做索引，多个用分号分隔 -->
    <entry key="ext_dict">ext.dic;</entry> 
    <!--用户可以在这里配置自己的扩展停止词字典,表示不用哪些词做索引，多个用分号分隔-->
    <entry key="ext_stopwords">stopword.dic;</entry> 
</properties>