lucene入门学习

最新推荐文章于 2020-11-19 23:46:45 发布

yanghaoplus

最新推荐文章于 2020-11-19 23:46:45 发布

阅读量214

点赞数

分类专栏： lucene elasticsearch 文章标签： lucene

本文链接：https://blog.csdn.net/yanghao201607030101/article/details/107776858

版权

elasticsearch 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

lucene

1 篇文章 0 订阅

订阅专栏

全部笔记以及相关的依赖包，配置，项目等项目中也有依赖和扩展词库与配置文件。
链接：https://pan.baidu.com/s/133EkTSWEn-ohtaqLbzufUA
提取码：9i44
Lucene是apache下的一个开放源代码的全文检索引擎工具包。提供了完整的查询引擎和索引引擎，部分文本分析引擎。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能。
全文检索(Full-text Search)
将非结构化数据中的一部分信息提取出来，重新组织，使其变得有一定结构，然后对此有一定结构的数据进行搜索，从而达到搜索相对较快的目的。这部分从非结构化数据中提取出的然后重新组织的信息，我们称之索引。
例如：字典。字典的拼音表和部首检字表就相当于字典的索引，对每一个字的解释是非结构化的，如果字典没有音节表和部首检字表，在茫茫辞海中找一个字只能顺序扫描。然而字的某些信息可以提取出来进行结构化处理，比如读音，就比较结构化，分声母和韵母，分别只有几种可以一一列举，于是将读音拿出来按一定的顺序排列，每一项读音都指向此字的详细解释的页数。我们搜索时按结构化的拼音搜到读音，然后按其指向的页数，便可找到我们的非结构化数据——也即对字的解释。
这种先建立索引，再对索引进行搜索的过程就叫全文检索(Full-text Search)。
虽然创建索引的过程也是非常耗时的，但是索引一旦创建就可以多次使用，全文检索主要处理的是查询，所以耗时间创建索引是值得的。

Lucene实现全文检索的流程

索引和搜索流程图
在这里插入图片描述

1、创建索引
1）获得文档
原始文档：要基于那些数据来进行搜索，那么这些数据就是原始文档。
搜索引擎：使用爬虫获得原始文档
站内搜索：数据库中的数据。
案例：直接使用io流读取磁盘上的文件。
2）构建文档对象
对应每个原始文档创建一个Document对象
每个document对象中包含多个域（field）
域中保存就是原始文档数据。
域的名称
域的值
每个文档都有一个唯一的编号，就是文档id
3）分析文档
就是分词的过程
1、根据空格进行字符串拆分，得到一个单词列表
2、把单词统一转换成小写。
3、去除标点符号
4、去除停用词
停用词：无意义的词
每个关键词都封装成一个Term对象中。
Term中包含两部分内容：
关键词所在的域
关键词本身
不同的域中拆分出来的相同的关键词是不同的Term。
4）创建索引
基于关键词列表创建一个索引。保存到索引库中。
索引库中：
索引
document对象
关键词和文档的对应关系
通过词语找文档，这种索引的结构叫倒排索引结构。
2、查询索引
1）用户查询接口
用户输入查询条件的地方
例如：百度的搜索框
2）把关键词封装成一个查询对象
要查询的域
要搜索的关键词
3）执行查询
根据要查询的关键词到对应的域上进行搜索。
找到关键词，根据关键词找到对应的文档
4）渲染结果
根据文档的id找到文档对象
对关键词进行高亮显示
分页处理
最终展示给用户看。

入门程序实例：

这里使用的IKAnalyzer作为分词器。
IKAnalyze的使用方法
1）把IKAnalyzer的jar包添加到工程中
2）把配置文件和扩展词典添加到工程的classpath下
注意：扩展词典严禁使用windows记事本编辑保证扩展词典的编码格式是utf-8
扩展词典hotword：添加一些新词，比如公司名，人名，添加后就会把这些当作一个词语分出来
停用词词典stopword：无意义的词或者是敏感词汇，比如把色情加入停用词典后就不会分出色情这个词，那就没办法检索到色情，相当于过滤了。
所需依赖见lib目录，src目录下即classpath下放了三个文件，分别为hotword扩展词典，stopword停用词词典，以及配置文件。
在这里插入图片描述
Field域的属性
是否分析：是否对域的内容进行分词处理。前提是我们要对域的内容进行查询。
是否索引：将Field分析后的词或整个Field值进行索引，只有索引方可搜索到。
比如：商品名称、商品简介分析后进行索引，订单号、身份证号不用分析但也要索引，这些将来都要作为查询条件。
是否存储：将Field值存储在文档中，存储在文档中的Field才可以从Document中获取
比如：商品名称、订单号，凡是将来要从Document中获取的Field都要存储。

在这里插入图片描述

在com.yh.lucene下创建两个类，分别为索引管理类IndexManager和搜索索引类SearchIndex.

IndexManager类：


/**
 * 索引库的维护，增删改
 * 步骤：
		1、创建一个Director对象，指定索引库保存的位置。
		2、基于Directory对象创建一个IndexWriter对象，IndexWriter对象需要
一个IndexWriterConfig参数作为构造函数参数，IndexWriterConfig默认以StandardAnalyzer为
分词器（不能切分中文），需要替换为IKAnalyzer分词器。
		3、读取磁盘上的文件，对应每个文件创建一个文档对象。
		4、向文档对象中添加域
		5、把文档对象写入索引库
		6、关闭indexwriter对象
 */
public class IndexManager {
    private IndexWriter indexWriter;

    /**
     * 初始化一些公共部分
     *
     * @throws IOException
     */
    @Before
    public void init() throws IOException {
        //创建索引库存放路径
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());
        //指定分词器
        IndexWriterConfig config = new IndexWriterConfig(new IKAnalyzer());
        //创建一个indexwriter对象
        indexWriter = new IndexWriter(directory, config);
    }


    /**
     * 创建索引库
     */
    @Test
    public void createIndex()throws Exception {
//    1、创建一个Director对象，指定索引库保存的位置。
        //把索引库保存到内存中
        //    Directory ramDirectory = new RAMDirectory();
        //把索引库保存到磁盘中
        Directory directory = FSDirectory.open(new File("C:\\temp\\index").toPath());

//		2、基于Directory对象创建一个IndexWriter对象
//        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());
        //原IndexWriterConfig默认采用的StandAnalyzer作分析器，不适合中文，所以要切换为IKAnalyzer
        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig(new IKAnalyzer()));

//		3、读取磁盘上的文件，对应每个文件创建一个文档对象。
        File dir = new File("E:\\java学习\\Lucene\\searchsource");
        File[] files = dir.listFiles();
        for (File file : files) {
            //取文件名
            String fileName = file.getName();
            //文件的路径
            String filePath = file.getPath();
            //文件内容
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            //文件的大小
            long fileSize = FileUtils.sizeOf(file);
            //4向文档对象中添加域
            //创建Field,域名、值、是否存储
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
//        Field fieldPath = new TextField("path", filePath, Field.Store.YES);
            Field fieldPath = new StoredField("path", filePath);
            Field fieldContent = new TextField("content", fileContent, Field.Store.YES);
//        Field fieldSize = new TextField("size", fileSize + "", Field.Store.YES);
            //用于做运算
            Field fieldSizeValue = new LongPoint("size", fileSize);
            //用于存储
            StoredField fieldSizeStore = new StoredField("size", fileSize);

            //创建文档对象
            Document document = new Document();
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldSizeStore);
            document.add(fieldSizeValue);
            document.add(fieldContent);
            //5、把文档对象写入索引库
            indexWriter.addDocument(document);
        }

//		6、关闭indexwriter对象
        indexWriter.close();

    }



    /**
     * 添加文档
     */
    @Test
    public void addDocument() throws IOException {
        //创建indexWriter部分已经提取出去
        //创建一个Document对象
        Document document = new Document();
        //向document对象中添加域。
        //不同的document可以有不同的域，同一个document可以有相同的域。
        document.add(new TextField("name", "新添加的文档", Field.Store.YES));
        //不存储，但是可以搜索到的，存储不存储影响的是能不能取出来该索引
        document.add(new TextField("content", "新添加的文档的内容", Field.Store.NO));
        //LongPoint创建索引
        document.add(new LongPoint("size", 1000l));
        //StoreField存储数据
        document.add(new StoredField("size", 1000l));
        //不需要创建索引的就使用StoreField存储
        document.add(new StoredField("path", "d:/temp/1.txt"));
        //添加文档到索引库
        indexWriter.addDocument(document);


    }


    @Test
    public void deleteAllDocuments() throws IOException {
        //删除全部文档
        indexWriter.deleteAll();
    }


    @Test
    public void deleteDocumentByQuery() throws IOException {
        /*//创建一个查询条件
		Query query = new TermQuery(new Term("filename", "apache"));
		//根据查询条件删除
		indexWriter.deleteDocuments(query);
*/
        indexWriter.deleteDocuments(new Term("name", "添加"));
    }

    /**
     * 修改文档，原理就是先删除后添加。
     */
    @Test
    public void updateDocuments() throws IOException {
//创建一个Document对象
        Document document = new Document();
        //向document对象中添加域。
        //不同的document可以有不同的域，同一个document可以有相同的域。
        document.add(new TextField("name", "要更新的文档", Field.Store.YES));
        document.add(new TextField("content", " Lucene 简介 Lucene 是一个基于 Java 的全文信息检索工具包," +
                "它不是一个完整的搜索应用程序,而是为你的应用程序提供索引和搜索功能。",
                Field.Store.YES));
        indexWriter.updateDocument(new Term("name", "spring"), document);

    }


    /**
     * 资源的释放
     */
    @After
    public void destory() throws IOException {
        //关闭indexwriter
        indexWriter.close();
    }
}

索引库创建完成后，可以使用luke查看索引库中的内容。

SearchIndex类：

/**
 * 索引库查询,
 * 步骤：
 * 1、创建一个Director对象，指定索引库的位置
 * 2、创建一个IndexReader对象
 * 3、创建一个IndexSearcher对象，构造方法中的参数indexReader对象。
 * 4、创建一个Query对象，TermQuery
 * 5、执行查询，得到一个TopDocs对象
 * 6、取查询结果的总记录数
 * 7、取文档列表
 * 8、打印文档中的内容
 * 9、关闭IndexReader对象
 */
public class SearchIndex {
    private IndexReader indexReader;
    private IndexSearcher indexSearcher;

    @Before
    public void init() throws IOException {
        indexReader = DirectoryReader.open(FSDirectory.open(new File("C:\\temp\\index").toPath()));
        indexSearcher = new IndexSearcher(indexReader);
    }

    /**
     * TermQuery不使用分析器所以建议匹配不分词的Field域查询
     * 根据关键词进行查询。
     */
    @Test
    public void testTermQuery() throws Exception {
        Query query= new TermQuery(new Term("content", "lucene"));
        printResult(query);
    }



    /**
     * 范围查询
     * @throws IOException
     */
    @Test
    public void testRangeQuery() throws IOException {
        Query query = LongPoint.newRangeQuery("size", 0l, 100l);
        printResult(query);
    }

    public void printResult(Query query) throws IOException {
        //		5、执行查询，得到一个TopDocs对象,传入的query对象和最多返回多少条查询数据
        TopDocs topDocs = indexSearcher.search(query, 10);
//		6、取查询结果的总记录数
        System.out.println(topDocs.totalHits);
        //		7、取文档列表
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
//		8、打印文档中的内容
        for (ScoreDoc doc : scoreDocs) {
            //取文档id
            int docId = doc.doc;
            //根据id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
            System.out.println(document.get("size"));
            System.out.println(document.get("content"));
            System.out.println("====================这是分割线");
        }
    }

    /**
     * 可以对要查询的内容先分词，然后基于分词的结果进行查询。
     * Query对象执行的查询语法可通过System.out.println(query);查询。
     需要使用到分析器。
     建议创建索引时使用的分析器和查询索引时使用的分析器要一致。
     需要加入queryParser依赖
     * @throws Exception
     */
    @Test
    public void testQueryParser() throws Exception {
        //创建queryparser对象
        //第一个参数默认搜索的域
        //第二个参数就是分析器对象
        QueryParser queryParser = new QueryParser("content", new IKAnalyzer());
        Query query = queryParser.parse("Lucene是java开发的");
        System.out.println(query);//content:lucene content:java content:开发，按结果按得分排名
        printResult(query);

    }


    @After
    public void destory() throws IOException {
        indexReader.close();
    }
}