lucene入门学习一

最新推荐文章于 2021-09-05 10:06:57 发布

yutao_Struggle

最新推荐文章于 2021-09-05 10:06:57 发布

阅读量433

点赞数 1

分类专栏： lucene 文章标签： lucene 全文检索

本文链接：https://blog.csdn.net/yutao_Struggle/article/details/78524392

版权

lucene 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

什么是Lucene

Lucene是apache软件基金会发布的一个开放源代码的全文检索引擎工具包，由资深全文检索专家Doug
Cutting所撰写,它是一个全文检索引擎的架构，提供了完整的创建索引和查询索引，以及部分文本分析的引擎，Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎，Lucene在全文检索领域是一个经典的祖先，现在很多检索引擎都是在其基础上创建的，思想是相通的。
即：Lucene是根据关健字来搜索的文本搜索工具，只能在某个网站内部搜索文本内容，不能跨网站搜索。

Lucene使用场景

Lucece不能用在互联网搜索（即像百度那样），只能用在网站内部的文本搜索（即只能在CRM，RAX，ERP内部使用），但思想是相通的。

Lucene中内容

Lucene中存的就是一系列的二进制压缩文件和一些控制文件，它们位于计算机的硬盘上，这些内容统称为索引库，索引库有二部份组成：
（1）原始记录
存入到索引库中的原始文本，例如：spring是一款轻量级java框架
（2）词汇表
按照一定的拆分策略（即分词器）将原始记录中的每个字符拆开后，存入一个供将来搜索的表

为什么网站内部用Lucene来搜索，而不用SOL来搜索

（1）SQL只能针对数据库表搜索，不能直接针对硬盘上的文本搜索
（2）SQL没有相关度排名
（3）SQL搜索结果没有关健字高亮显示
（4）SQL需要数据库的支持，数据库本身需要内存开销较大，例如：Oracle
（5）SQL搜索有时较慢，尤其是数据库不在本地时，超慢，例如：Oracle
（6）针对关键字的搜索数据库用like之类的模糊全文检索相对慢很多

Lucene jar包：
lucene-core-3.0.2.jar【Lucene核心】
lucene-analyzers-3.0.2.jar【分词器】
lucene-highlighter-3.0.2.jar【Lucene会将搜索出来的字，高亮显示，提示用户】
lucene-memory-3.0.2.jar【索引库优化策略】
资源地址：http://download.csdn.net/download/yutao_struggle/10129441

Lucene调用图

lucene是将javabean对象转换成document对象，并按照一定的拆分策略（即将内容拆成单个字或词语等，用户就可以通过单个字或词语来搜索）存入索引库中。
lucene图解
下面编写一个简单的Lucene应用，首先创建一个javabean对象

public class Article {
    private int id;         //编号
    private String title;       //标题
    private String content; //内容

    public Article() {
        super();
    }
    public Article(int id, String title, String content) {
        super();
        this.id = id;
        this.title = title;
        this.content = content;
    }
    public int getId() {
        return id;
    }
    public void setId(int id) {
        this.id = id;
    }
    public String getTitle() {
        return title;
    }
    public void setTitle(String title) {
        this.title = title;
    }
    public String getContent() {
        return content;
    }
    public void setContent(String content) {
        this.content = content;
    }
    @Override
    public String toString() {
        return "编号:"+id+"标题:"+title+"\n内容:"+content;
    }
}

编写lucene测试类

public class Test {
    /*创建Lucene索引步骤：
     * 1.创建Javabean对象
     * 2.创建document对象
     * 3.将javabean对象中的属性设置到document对象中，属性名可以相同或不相同
     * 4.指定索引库目录、分词器、最大字段长度等属性
     * 5.创建IndexWriter对象流
     * 6.将document对象通过IndexWriter对象写入到索引库中
     * 7.关闭IndexWriter对象
     * 
     * */
    @org.junit.Test
    public void createIndexDB() throws Exception{
        Article article = new Article(1,"java","面向对象的编程语言");
        Document document = new Document();
        document.add(new Field("id",String.valueOf(article.getId()),Store.YES,Index.ANALYZED)); //Store.YES表示将该属性加入词汇表，Index.ANALYZED表示该属性会被分词
        document.add(new Field("title",article.getTitle(),Store.YES,Index.ANALYZED));
        document.add(new Field("content",article.getContent(),Store.YES,Index.ANALYZED));
        Directory directory = FSDirectory.open(new File("E:/DBDB"));                //索引库目录
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);            //分词器，Version为Lucene版本号
        MaxFieldLength mfl = MaxFieldLength.LIMITED;                        //属性最多可分几个词，默认1万个词
        IndexWriter indexWriter = new IndexWriter(directory, analyzer, mfl);
        indexWriter.addDocument(document);
        indexWriter.close();
    }
    /*
     * 通过关键字从索引库查询数据步骤:
     * 1.指定索引库目录、分词器、最大字段长度等属性
     * 2.创建IndexSearcher对象流
     * 3.创建QueryParser对象
     * 4.将关键字封装到Query对象中
     * 5.通过IndexSearcher对象从索引库的词汇表中查询若干条符合条件的记录（若查询到的记录条数不足定义的，则以实际为准）
     * 6.获取符合条件的记录编号
     * 7.IndexSearcher对象通过编号到原始记录表中查询元素记录，返回document对象
     * 8.将document对象中所有属性取出，在封装到javabean对象中
     * 
     */
    @org.junit.Test
    public void findIndexDB() throws Exception{
        String keyword = "编程";      //关键字
        ArrayList<Article> articles = new ArrayList<>();
        Directory directory = FSDirectory.open(new File("E:/DBDB"));
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);        //分词器
        MaxFieldLength mfl = MaxFieldLength.LIMITED;
        IndexSearcher indexSearcher = new IndexSearcher(directory);
        QueryParser queryParser = new QueryParser(Version.LUCENE_30,"content",analyzer);    //content表示关键字处于content中的记录
        Query query = queryParser.parse(keyword);
        TopDocs topDocs = indexSearcher.search(query, 100);
        for(int i=0;i<topDocs.scoreDocs.length;i++){
            ScoreDoc socreDoc = topDocs.scoreDocs[i];
            int doc = socreDoc.doc;     //编号
            Document document = indexSearcher.doc(doc);     
            Article arcticle = new Article(Integer.parseInt(document.get("id")),document.get("title"),document.get("content"));
            articles.add(arcticle);
        }
        System.out.println(articles);
    }
}

上述TopDocs对象有totalHits和ScoreDocs属性，其中ScoreDocs是一个数组，它有二个属性，分别是doc和score，doc表示的是根据关键字查询到记录在原始记录表中的编号，score表示分数，查询的记录会根据分数排序。
这里写图片描述

将上述代码进行封装，利用反射创建一个工具类：

public class LuceneUtil {
    private static Directory directory;                 //目录
    private static Version version;                     //lucene版本
    private static Analyzer analayzer;                  //分词器
    private static MaxFieldLength maxFieldLength;       //最多分词数

    static{
        try {
            directory = FSDirectory.open(new File("D:/DBDBDB"));
            version = Version.LUCENE_30;
            analayzer = new StandardAnalyzer(version);
            maxFieldLength = MaxFieldLength.LIMITED;
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    public static Directory getDirectory() {
        return directory;
    }
    public static Version getVersion() {
        return version;
    }
    public static Analyzer getAnalayzer() {
        return analayzer;
    }
    public static MaxFieldLength getMaxFieldLength() {
        return maxFieldLength;
    }
    private LuceneUtil(){

    }
    /*
     * 将javabean对象转换成documen对象
     */
    public static Document java2Document(Object obj){
        Document document = null;
        try {
            Class clazz = obj.getClass();
            java.lang.reflect.Field[] fields = clazz.getDeclaredFields();
            document = new Document();
            for(java.lang.reflect.Field field : fields){
                field.setAccessible(true);
                String fieldName = field.getName();
                String methodName = "get"+fieldName.substring(0,1).toUpperCase()+fieldName.substring(1);
                Method method = clazz.getMethod(methodName, null);
                Object value = method.invoke(obj, null);
                document.add(new Field(fieldName,value.toString(),Store.YES,Index.ANALYZED));
            }
        } catch (SecurityException | NoSuchMethodException | IllegalAccessException | IllegalArgumentException
                      | InvocationTargetException e) {
            new RuntimeException("转换失败");
        }
        return document;
    }
    /*
     * 将document对象转换成javabean对象
     */
    public static Object document2Java(Document document,Class clazz){
        Object obj = null;
        try {
            obj = clazz.newInstance();
            java.lang.reflect.Field[] fields = clazz.getDeclaredFields();
            for(java.lang.reflect.Field field : fields){
                field.setAccessible(true);
                String fieldName = field.getName();
                String methodName = "set"+fieldName.substring(0,1).toUpperCase()+fieldName.substring(1);
                Class<?> type = field.getType();
                Method method = clazz.getMethod(methodName,type);
                BeanUtils.setProperty(obj, fieldName, document.get(fieldName));     //这里使用了BeanUtils工具类，请导入commons-beanutils-1.7.0.jar包
            }
        } catch (SecurityException | NoSuchMethodException | IllegalAccessException | IllegalArgumentException | InvocationTargetException | InstantiationException e) {
            new RuntimeException("document转换java失败");
        }
        return obj;
    }
}

上面已经介绍了Lucene怎样将一个javabean对象写入索引库，以及如何通过关键字从索引库中查询记录。下面代码是关于Lucene的update和delete功能：

@Test
    public void testUpdate() throws Exception{
        Article article = new Article(1,"js","面向对象的脚本语言");
        IndexWriter indexWriter = new IndexWriter(LuceneUtil.getDirectory(),LuceneUtil.getAnalayzer(),LuceneUtil.getMaxFieldLength());
        indexWriter.updateDocument(new Term("id","1"),LuceneUtil.java2Document(article));       //根据document的id属性更新
        indexWriter.close();
    }
    @Test
    public void testDelete() throws Exception{
    IndexWriter indexWriter = new IndexWriter(LuceneUtil.getDirectory(),LuceneUtil.getAnalayzer(),LuceneUtil.getMaxFieldLength());
        //indexWriter.deleteDocuments(new Term("id","1"));  //根据document的id属性删除
        //indexWriter.deleteAll();
        QueryParser queryParser = new QueryParser(LuceneUtil.getVersion(),"content",LuceneUtil.getAnalayzer());
        Query query = queryParser.parse("对象");
        indexWriter.deleteDocuments(query);
        indexWriter.close();
    }

使用updateDocument方法时，如果没有要update的document对象，此方法相当于addDocument，而当有多个document对象匹配时，这些对象将被更新的一个对象所取代，其中删除方法deleteAll()可以删除所有原始记录，delete还可以根据Query或Term删除一个或多个。

提示：根据关键字搜索或删除不可以有英文，因为使用的StandardAnalyzer分词器不支持英文分词检索，更多内容下节见