在Solr中配置和使用ansj分词

最新推荐文章于 2020-09-07 11:22:15 发布

麦克是个程序员

最新推荐文章于 2020-09-07 11:22:15 发布

阅读量815

点赞数

分类专栏：检索文章标签： solr ansj

检索专栏收录该内容

52 篇文章 0 订阅

订阅专栏

步骤主要包括：下载或者编译ansj和nlp-lang等jar包、在schema中配置相关类型、将ansj和nlp-lang等jar包配置到solr中、测试ansj分词效果。

一、下载或者编译ansj-seg和nlp-lang等jar包。

　 1、您可以到 http://maven.ansj.org/org/ansj/ansj_seg/ | http://maven.ansj.org/org/nlpcn/ 中下载相关jar包。

　　　　ansj-seg相关jar包，如下图所示：

nlp-lang 是ansj-seg分词中关于自然语言处理相关工具类，功能比较强大：

　　2、下载相关源码，自己编译。

　　　　这种是相对复杂的，但是如果长久使用，这种是很有必要的。对于这种优秀的分词，我们更有必要好好研究一番。

　　　　github地址：https://github.com/NLPchina/ansj_seg

　　　　git客户端地址：http://git-scm.com/download/

git下载源码命令：git clone https://github.com/NLPchina/ansj_seg.git

　　　　下载后的文件结构如下：

　　　可见代码是用maven组中管理的。对于maven的安装配置本文旧粗略带过，主要包括：

　　下载maven相关包，解压：

　　　配置环境变量M2_HOME:C:\apache-maven-3.2.1

　　　配置PATHb环境变量：%M2_HOME%\bin;

　　　mvn常有命令：mvn clean install#清理本地缓存、下载依赖jar包可以添加-DskipTests=true忽略单元测试；mvn eclipse:clean #清理mvn生成的eclipse工程；mvn eclipse:eclipse #根据pom.xml生成eclipse工程。

　　　步骤：

　　　　　在源码根路径下执行：　mvn clean install -DskipTests=true 命令，在target目录下生成jar包。

　　　　target目录：

　　　同义的道理，可以编译nlp-lang jar包，地址：https://github.com/NLPchina/nlp-lang 　　

二、在solr schema.xml中配置好ansj字段类型。

　　1、创建ansj类型。

　　　　找到schema.xml，添加ansj类型text_ansj:　

 
        <!--ansj start --> 
       
        <fieldType name= 
        "text_ansj"  
        class 
        = 
        "solr.TextField"  
        positionIncrementGap= 
        "100" 
        > 
       
        <analyzer type= 
        "index" 
        > 
       
        <tokenizer  
        class 
        = 
        "org.ansj.solr.AnsjTokenizerFactory"   
        isQuery= 
        "false" 
        /> 
       
        </analyzer> 
       
        <analyzer type= 
        "query" 
        > 
       
        <tokenizer  
        class 
        = 
        "org.ansj.solr.AnsjTokenizerFactory" 
        /> 
       
        </analyzer> 
       
        </fieldType> 
       
        <!--ansj end -->

　　org.ansj.solr.AnsjTokenizerFactory 是我们编译的ansj-lucene插件。

　　2、配置需要索引的字段。　　　　

 
   
        <!-- ansj_test field --> 
       
 
        <field name= 
        "POI_OID"  
        type= 
        "string"  
        indexed= 
        "false"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_NAME"  
        type= 
        "text_ansj"  
        indexed= 
        "true"  
        stored= 
        "false" 
        /> 
       
 
        <field name= 
        "POI_NAME_SUGGEST"  
        type= 
        "string"  
        indexed= 
        "false"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_ADDRESS"  
        type= 
        "text_ansj"  
        indexed= 
        "true"  
        stored= 
        "false" 
        /> 
       
 
        <field name= 
        "POI_ADDRESS_SUGGEST"  
        type= 
        "string"  
        indexed= 
        "false"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_PHONE"  
        type= 
        "string"  
        indexed= 
        "true"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_TYPE"  
        type= 
        "string"  
        indexed= 
        "true"  
        stored= 
        "true"  
        multiValued= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_URL"  
        type= 
        "string"  
        indexed= 
        "false"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_DIANPING"  
        type= 
        "string"  
        indexed= 
        "true"  
        stored= 
        "true"  
        /> 
       
 
        <field name= 
        "POI_BRAND"  
        type= 
        "string"  
        indexed= 
        "true"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_CITY"  
        type= 
        "string"  
        indexed= 
        "true"  
        stored= 
        "true"  
        multiValued= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_TAG"  
        type= 
        "text_ansj"  
        indexed= 
        "true"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_LAT"  
        type= 
        "double"  
        indexed= 
        "false"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_LON"  
        type= 
        "double"  
        indexed= 
        "false"  
        stored= 
        "true" 
        /> 
       
 
        <field name= 
        "POI_DATA_TYPE"  
        type= 
        "string"  
        indexed= 
        "true"  
        stored= 
        "false" 
        />