Lucene学习-创建索引、关键词查询

最新推荐文章于 2022-10-06 19:14:12 发布

Bart_G

最新推荐文章于 2022-10-06 19:14:12 发布

阅读量567

点赞数 1

分类专栏： Lucene 文章标签： lucene 全文检索索引

本文链接：https://blog.csdn.net/rocky_03/article/details/69053249

版权

Lucene 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Lucene是一个全文检索的开源软件，对需要查询的关键词进行检索

1. 需要的jar包

lucene-analyzers-3.0.2.jar
lucene-core-3.0.2.jar
lucene-highlighter-3.0.2.jar
lucene-memory-3.0.2.jar

2. 编码步骤

2.1 准备Article文章类

public class Article {
    private Integer id;
    private String title;
    private String content;

    public Article(){}

    public Article(Integer id, String title, String content) {
        super();
        this.id = id;
        this.title = title;
        this.content = content;
    }
        @Override
    public String toString() {
        return "Article [id=" + id + ", title=" + title + ", content="
                + content + "]";
    }
    //省略getter/setter方法
......
}

2.2 创建索引库

2.2.1 步骤

创建Article对象
创建Document对象
将Article对象的三个属性值绑定到Document对象中
创建IndexWriter字符流对象
将document对象写入lucene索引库
关闭indexWriter字符流对象

2.2.2 方法实例

/**
     * 创建索引库
     * 将Article对象放入索引库的原始记录中，形成词汇表
     * @throws IOException 
     */
    @Test
    public void createIndexDB() throws IOException{
        // 1. 创建Article对象
        Article article = new Article(1,"处理器","处理器是电脑的核心部件");
        // 2. 创建Document对象
        Document document = new Document();
        // 3. 将Article对象的三个属性值绑定到Document对象中
        /*
         * 参数1：document对象中的属性名叫xid，article对象中的属性名为id，项目中建议相同
         * 参数2：xid对应的article的属性id，注意必须转化为字符型
         * 参数3：是否将xid属性值存入由原始记录表中转存入词汇表
         *          - Store.YES：表示该属性值会存入词汇表
         *          - Store.NO：表示该属性值不会存入词汇表
         * 参数4：是否将xid属性进行分词算法
         *          - Index.ANALYZED：表示会对该属性进行词汇拆分
         *          - Index.NOT_ANALYZED：表示不会对该属性进行词汇拆分
         *          项目中建议非 id 属性都进行分词拆分
         */
        document.add(new Field("xid", article.getId().toString(),Store.YES,Index.ANALYZED));
        document.add(new Field("xtitle", article.getTitle(),Store.YES,Index.ANALYZED));
        document.add(new Field("xcontent", article.getContent(),Store.YES,Index.ANALYZED));

        // 4. 创建IndexWriter字符流对象
        Directory directory = FSDirectory.open(new File("D:/IndextDB"));
        Version version = Version.LUCENE_30;
        Analyzer analyzer = new StandardAnalyzer(version);
        MaxFieldLength maxFieldLength = MaxFieldLength.LIMITED;
        /*
         * 参数1：Lucene索引库对应在磁盘的目录中，D:/IndextDB
         * 参数2：采用什么策略将文本拆分，一个策略就是一个具体的实现类
         * 参数3：最多将文本拆出来多少词汇，LIMITED表示1W个，即只取前1W个词汇，如果不足1W个词汇个，以实际为准
         */
        IndexWriter indexWriter = new IndexWriter(directory, analyzer, maxFieldLength);
        // 5. 将document对象写入lucene索引库
        indexWriter.addDocument(document);
        // 6. 关闭indexWriter字符流对象
        indexWriter.close();
    }

该方法执行后会在对应的磁盘创建索引库文件
这里写图片描述
此处有三个.cfs文件是因为下面创建了多个，下文会做解释

2.3 关键词检索

2.3.1 步骤

准备工作,创建需要查询的关键词（String类型）
创建接受结果的List<Article>集合
创建IndexSearcher字符流对象
创检QueryParser查询解析器对象
创建Query对象封装查询关键字
根据关键字，去索引库中查找相关词汇返回TopDocs索引号对象
迭代词汇表中符合条件的编号

2.3.2 方法实例

/**
     * 根据关键字词从索引库中取出来符合条件的内容
     * @throws IOException 
     * @throws ParseException 
     */
    @Test
    public void findIndexDB() throws IOException, ParseException{
        // 0. 准备工作,创建需要查询的关键词
        String keyword = "处";
        // 1. 创建接受结果的list集合
        List<Article>articleList = new ArrayList<Article>();
        Directory directory = FSDirectory.open(new File("D:/IndextDB"));
        Version version = Version.LUCENE_30;
        Analyzer analyzer = new StandardAnalyzer(version);
        MaxFieldLength maxFieldLength = MaxFieldLength.LIMITED;

        // 2. 创建IndexSearcher字符流对象
        IndexSearcher indexSearcher = new IndexSearcher(directory);

        // 3. 创检查询解析器对象
        /*
         * 参数一：使用分词器的版本，提倡使用最新的版本
         * 参数二：对document对象中的哪一个属性进行搜索
         * 参数三：参数解析器
         * */
        QueryParser queryParser = new QueryParser(version,"xcontent", analyzer);
        // 4. 创建对象封装查询关键字
        Query query = queryParser.parse(keyword);
        // 5. 根据关键字，去索引库中查找相关词汇返回TopDocs索引号对象
        /*
         * 参数一：表示封装关键字查询对象，其它QueryParser表示查询解析器
         * 参数二：MAX_RECORD表示如果根据关键字搜索出来的内容较多，只取前MAX_RECORD个内容
         *         不足MAX_RECORD个数的话，以实际为准
         * */
        int MAX_RECORD = 100;
        TopDocs topDocs = indexSearcher.search(query, MAX_RECORD);
        // 6. 迭代词汇表中符合条件的编号
        for(int i=0;i<topDocs.scoreDocs.length;i++){
            // 6.1 取出封装编号和分数的scoreDoc对象
            ScoreDoc scoreDoc = topDocs.scoreDocs[i];
            // 6.2 取出来每一个编号
            int no = scoreDoc.doc;
            // 6.3 根据编号去索引库中原始记录表中查询对应的document对象
            Document document = indexSearcher.doc(no);
            // 6.4 获取document对象的三个属性值
            String xid = document.get("xid");
            String xtitle = document.get("xtitle");
            String xcontent = document.get("xcontent");
            // 6.5 封装到article对象中
            Article article = new Article(Integer.parseInt(xid), xtitle, xcontent);
            // 6.6 将article添加到list集合中
            articleList.add(article);
        }
    }

输出查询结果：

Article [id=1, title=处理器, content=处理器是电脑的核心部件]

3. Lucene查询的优化

3.1 创建工具类

上面的代码步骤繁多，看起来都很麻烦，我们把常用的和冗余的代码抽取出来，封装到工具类中，可以是代码更简洁。
该工具类用到了 java.lang.reflect 反射包

package com.bart.lucene.util;

import java.io.File;
import java.lang.reflect.Field;
import java.lang.reflect.Method;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;

import com.bart.lucene.entity.Article;

/**
 * 封装常用的操作
 * @author bart
 *
 */
public class LuceneUtils {

    private static Directory directory;//Lucene索引库对应在磁盘的目录中，D:/IndextDB
    private static Version version;//版本号
    private static Analyzer analyzer;//采用什么策略将文本拆分，一个策略就是一个具体的实现类
    private static MaxFieldLength maxFieldLength;//默认1W条记录

    static{
        try {
            directory =  FSDirectory.open(new File("E:/IndexDBDBDB"));
            version = Version.LUCENE_30;
            analyzer = new StandardAnalyzer(version);
            maxFieldLength = MaxFieldLength.LIMITED;
        } catch (Exception e) {
            e.printStackTrace();
            //手动抛出运行时异常
            throw new RuntimeException(e);
        }
    }

    public static Directory getDirectory() {
        return directory;
    }

    public static Version getVersion() {
        return version;
    }

    public static Analyzer getAnalyzer() {
        return analyzer;
    }

    public static MaxFieldLength getMaxFieldLength() {
        return maxFieldLength;
    }

    //私有的构造方法，防止外部new来
    private LuceneUtils(){}

    /**
     * JavaBean转化为Document对象
     * @param object
     * @return Document
     * @throws Exception 
     */
    public static Document javaBean2Document(Object obj) throws Exception{
        Document document = new Document();
        // 1. 获取obj的字节码
        Class clazz = obj.getClass();
        // 2. 获取私有属性
        Field[] fields = clazz.getDeclaredFields();
        // 3. 迭代私有属性
        for(Field field : fields){
            // 3.1 设置允许访问，强力反射，提高速度
            field.setAccessible(true);
            // 3.2 获取属性名
            String name = field.getName();
            // 3.3 拼接方法名 getId()/getTitle()
            String methodName = "get"+name.substring(0, 1).toUpperCase()+name.substring(1);
            // 3.4 获取方法,(方法名,参数名)
            Method method = clazz.getMethod(methodName,null);
            // 3.5 执行方法，获得返回值
            String value = method.invoke(obj,null).toString();
//            // 3.6 添加到document对象中
            document.add(new org.apache.lucene.document.Field(name,value,Store.YES,Index.ANALYZED));
        }
        return document;
    }

    /**
     * document对象转化为javabean对象
     * @param document
     * @param clazz
     * @return t
     * @throws Exception
     */
    public static  <T> T document2JavaBean(Document document,Class<T>clazz) throws Exception{
        // 1. 创建对应的实例
        T t = clazz.newInstance();
        // 2. 通过反射得到私有成员变量
        Field[] fields = clazz.getDeclaredFields();
        // 3. 遍历私有成员变量
        for(Field field : fields){
            // 3.1 得到成员变量名
            String name = field.getName();
            // 3.2 得到document中的成员变量值
            String value = document.get(name);
            // 3.3 使用BeanUtils.setProperty(类实例,变量名,变量值)
            BeanUtils.setProperty(t, name, value);
        }
        return t;
    }

    //测试
    public static void main(String[] args) throws Exception{
        Article article = new Article(1, "处理器","处理器是一台电子计算机的重要部件");
        Document document = LuceneUtils.javaBean2Document(article);
        System.out.println(document.toString());
        System.out.println("---------");
        Article article2 = LuceneUtils.document2JavaBean(document, Article.class);
        System.out.println(article2);
    }
}

3.2 重构FirstApp类

使用封装的工具类重构FirstApp.java

3.2.1 创建索引库

@Test
    public void createIndexDB() throws Exception{
        // 1. 创建Article对象
        //Article article = new Article(1,"处理器","处理器是电脑的核心部件");
        //Article article = new Article(2,"显卡","显卡是电脑的显示输出部件");
        Article article = new Article( 3,"内存条","内存条是电脑的核心部件之一");
        // 2. 用LuceneUtils工具类的方法获得Document对象
        Document document = LuceneUtils.javaBean2Document(article);
        // 3. 创建IndexWriter字符流对象
        IndexWriter indexWriter = new IndexWriter(LuceneUtils.getDirectory(),LuceneUtils.getAnalyzer(),LuceneUtils.getMaxFieldLength());
        // 4. indexWriter对象写入document对象
        indexWriter.addDocument(document);
        // 5. 关闭indexWriter流
        indexWriter.close();
    }

因为此处创建了三个Article都放在了索引库中，故此在2.2.2中的图中三个.cfs文件。

3.2.2 根据关键词查询

 @Test
    public void findIndexDB() throws Exception{
        // 1. 创建keyword
        //准备工作
        String keyword = "显卡";
        List<Article>articleList = new ArrayList<Article>();
        // 2. 创建indexSearcher
        IndexSearcher indexSearcher = new IndexSearcher(LuceneUtils.getDirectory());
        // 3. 创检查询解析器对象
        QueryParser queryParser = new QueryParser(LuceneUtils.getVersion(),"content",LuceneUtils.getAnalyzer());
        // 4. 创建对象封装查询关键字
        Query query = queryParser.parse(keyword);
        // 5. 查询
        int MAX_RECORD=100;
        TopDocs topDocs = indexSearcher.search(query, MAX_RECORD);
        // 6. 迭代词汇表中符合条件的编号
        for(int i=0;i<topDocs.scoreDocs.length;i++){
            // 6.1 取出封装编号和分数的scoreDoc对象
            ScoreDoc scoreDoc = topDocs.scoreDocs[i];
            // 6.2 取出来每一个编号
            int no = scoreDoc.doc;
            // 6.3 根据编号去索引库中原始记录表中查询对应的document对象
            Document document = indexSearcher.doc(no);
            // 6.4 利用工具类将document对象转化为article对象
            Article article = LuceneUtils.document2JavaBean(document, Article.class);
            // 6.5 将查询到的article对象添加到list集合中
            articleList.add(article);
        }
        //遍历结果
        for(Article article : articleList){
            System.out.println(article);
        }

    }

输出结果：

Article [id=2, title=显卡, content=显卡是电脑的显示输出部件]

哈哈，作为一个卡巴司机，当然首先查询一下显卡了。

总结

我们发现代码重构之后，代码简洁了很多，而且工具的重用性更强了。在之后的项目中可以导包，直接调用。

Bart_G

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录