[lucene]04.SmartChineseAnalyzer中文分词的使用

乐之者java

于 2020-04-21 22:50:51 发布

阅读量890

点赞数

分类专栏： java 文章标签： lucene

本文链接：https://blog.csdn.net/xiaozhuangyumaotao/article/details/105670107

版权

java 专栏收录该内容

120 篇文章 0 订阅

订阅专栏

修改上篇文章中创建索引的代码为如下：

package cn.zhao.cms.column;

import java.io.File;
import java.nio.file.Paths;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class TestLucene {
	public static void main(String[] args) throws Exception{
          //指定存放lucene生成的索引的目录，可以把这个目录看做一张表，lucene的所有索引都存放在这个表中
	    Directory directory = FSDirectory.open(Paths.get("D:\\lucene\\index"));
	    //使用默认的英文分词器
	    //	    Analyzer analyzer=new StandardAnalyzer();
	    //使用中文分词器
	    SmartChineseAnalyzer analyzer=new SmartChineseAnalyzer();
	    IndexWriterConfig conf=new IndexWriterConfig(analyzer);
	    IndexWriter iwriter =new IndexWriter(directory, conf);
	    //每次创建先删除原来的已经生成的索引
		   iwriter.deleteAll();
		   //deleteAll只是放进了lucene索引回收站，这里还要从回收站也删除掉
		    iwriter.forceMergeDeletes();
	    
	    //生成这个文件的索引，当然，真实的情况会有很多目录以及很多文件，这里我们举例只拿一个文件来说事
	    File file=new File("D:\\lucene\\data\\tomcat如何配置根目录访问.html");
    	//Document可以看做表的一条记录，记录自然有多个field,可以看做表里的字段
    	//field在lucene里叫域，一个域有n(>=1)个terms，这个terms没法类比关系型数据库里面的概念了,
    	//你可以理解每个字段的内容按一定的切分规则(分词器)切分成了若干单元，每个单元就叫做term
    	Document document=new Document();
    	document.add(new StringField("id", 0+"", Field.Store.YES));
    	document.add(new StringField("docurl", file.getAbsolutePath(), Field.Store.YES));
    	TextField textField = new TextField("content", FileUtils.readFileToString(file), Field.Store.NO);
    	document.add(textField);
    	//按指定的分词规则analyzer写入存放索引的目录D:\\lucene\\index
    	iwriter.addDocument(document);
    	System.out.println("实际索引多少个文档："+iwriter.numDocs());	    
	    System.out.println("最大索引多少个文档："+iwriter.maxDoc());	    
	    //最后记得关闭资源
	    iwriter.close();
	    directory.close();
	}
}

代码说明：

1.分词器引起的变化：

这里我们的分词器由原来的StandardAnalyzer换成了支持中文的分词器SmartChineseAnalyzer，打开luke工具，发现同一个文档，信息却已经有所变化了：