Lucene入门+实战

最新推荐文章于 2024-04-16 02:00:17 发布

Fenco_Han

最新推荐文章于 2024-04-16 02:00:17 发布

阅读量298

点赞数

分类专栏：全文检索文章标签： lucene

本文链接：https://blog.csdn.net/m0_37694106/article/details/103997854

版权

全文检索专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一、什么是全文检索

1. 数据的分类

结构化数据

格式固定、长度固定、数据类型固定

非结构化数据

word文档、pdf文档，邮件、html

格式不固定、长度不固定、数据类型不固定

2. 数据的查询

1）结构化数据的查询

SQL语句，查询结构化数据的方法简单，速度快。

2）非结构化数据的查询

从文本文件中查询包含spring单词的文本文件。

一个一个打开，然后查询。
使用程序把文件读取到内存，然后查找匹配（顺序扫描）
将非结构化数据变成结构化数据，然后在查询扫描。

先跟进空格进行字符串拆分，得到一个单词列表，基于单词列表创建索引。

然后查询索引，根据单词与文档对应关系找到文档列表，这个过程为全文检索。

3.全文检索

现创建索引，然后查询索引的过程就是全文检索。（索引一次创建，所以查询）

二、全文检索应用场景

1. 搜索引擎，
2. 站内搜索：论坛搜索，微博搜索，文章搜索。
3. 电商搜索：京东，有品，天猫。

三、什么是Lucene

Lucene是一个基于java开发的全文搜索工具包。

四、Lucene实现全文检索的流程

1.创建索引

1）获取原始文档

原始文档:基于哪些数据来进行搜索，这些数据就是原始文档。
搜索引擎：使用爬虫获得原始文档
站内搜索：数据库中的数据
案例：直接使用IO流读取磁盘文件

2）构建文档对象

对应每个原始文档创建一个Document对象。
每个DOcument对象中包含多个域（field）,域中保存就是原始数据。
域名称域的值每个文档都有一个唯一的编号，就是文档id

3）文档分析

1.根据空格进行字符串拆分，得到单词列表。
2.把单词统一为大写，或许小写
3.去除标点符号
4.去除停用词（无意义的词）
5.每个关键词都封装成一个term对象中，term中包含：关键词所在的域，关键词本身
6.不同域拆分出来的相同关键词是不同的term

4）创建索引

基于关键词列表创建一个索引，保存到索引库中。

索引库中包含：索引,文档对象,关键词与文档的对应关系

通过词语找文档，这种索引结构叫做倒排索引结构。

查询索引

1）用户查询接口：用户输入查询条件的地方

2）把关键词封装成一个查询对象

要查询的域

要搜索的关键词

3）执行查询

根据要查询的关键词到对应的域上进行搜索

找到关键词，根据关键词找到对应文档

4）渲染结果

根据文档的ID，找到文档对象，对关键词进行高亮显示，分页处理，展示给用户。

五、入门程序

1.创建索引：

环境：下载lucene,jdk9

工程搭建：创建一个java工程，添加jar:

lucene-analyzers-common-7.4.0.jar

lucene-core-7.4.0.jar

common.io.jar

步骤：

    @Test
    public void createIndex() throws Exception {
        //1.创建一个Directory对象，指定索引保存的位置

        //把索引库保存到内存中
//        RAMDirectory directory = new RAMDirectory();

        //把索引库保存到磁盘
        Directory directory = FSDirectory.open(new File("E:\\lucene\\index").toPath());
        //2.基于Directory对象创建一个IndexWirter对象
        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());
        //3.读取磁盘上的文件，对应每个文件创建一个文档对象
        File dir = new File("E:\\lucene\\searchSource");
        File[] files = dir.listFiles();
        for (File file:files
             ) {
            //获取文件名
            String fileName = file.getName();
            //获取文件路径
            String filePath = file.getPath();
            //文本内容
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            //文件大小
            long fileSize = FileUtils.sizeOf(file);
            //创建field
            Field fieldName = new TextField("name", fileName, Field.Store.YES);
            Field fieldPath = new TextField("path",filePath, Field.Store.YES);
            Field fieldContent = new TextField("content",fileContent, Field.Store.YES);
            //fileSize + ""：是为了把long转成字符串
            Field fieldSize = new TextField("size",fileSize + "", Field.Store.YES);
            //创建文档对象
            Document document = new Document();
            //4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
            document.add(fieldSize);
            //5.把文档对象写入索引库
            indexWriter.addDocument(document);
        }
        //6.关闭indexwirter对象
        indexWriter.close();
    }

2.使用luke查看索引库中地内容

3.查询索引库

步骤：

 @Test
    public void searchIndex() throws Exception {
        //1.创建一个Director对象，指定索引库位置
        FSDirectory directory = FSDirectory.open(new File("E:\\lucene\\index").toPath());
        //2.创建一个IndexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //3.创建一个IndexSearcher对象，构造方法中地参数indexReader对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //4.创建一个Query对象，TermQuery
        Query query = new TermQuery(new Term("content","spring"));
        //5.执行查询，得到一个TopDocs对象
        //参数1：查询对象  参数2：查询结果返回地最大记录数
        TopDocs topDocs = indexSearcher.search(query, 10);
        //6.去查询结果的总记录数
        System.out.println("查询总记录数："+topDocs.totalHits);
        //7.取文档列表
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        //8.打印文档中内容
        for (ScoreDoc doc:scoreDocs
             ) {
            //取文档id
            int docId = doc.doc;
            //根据id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
//            System.out.println(document.get("content"));
            System.out.println(document.get("size"));
            System.out.println("========================");
        }
        //9.关闭indexReader对象
        indexReader.close();
    }

六、分析器

默认使用的标准分析器：StandardAnalyzer

1.查看标准分析器的分析效果

使用Analyer对象的TokenStream方法返回一个tokenStream对象。词对象中包含最终的分词结果。

实现步骤：

    @Test
    public void testTokenStream() throws Exception {
        //1.创建一个Analyzer对象，StandardAnalyzer对象
        Analyzer analyzer = new StandardAnalyzer();
        //2.使用分析器对象的tokenStream方法获得一个TokenStream对象
        TokenStream tokenStream = analyzer.tokenStream("", "The Spring Framework provides a comprehensive programming and configuration model.");
        //3.向TokenStream对象中设置一个引用，相当于数一个指针
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //4.调用TokenStream对象的rest方法，如果不调用抛出异常
        tokenStream.reset();
        //5.使用while遍历循环tokenStream对象
        while (tokenStream.incrementToken()) {
            System.out.println(charTermAttribute.toString());
        }
        //6.关闭tokenStream对象
        tokenStream.close();
    }

2.中文分析器：IKAnalyzer使用方法

1）把IKAnalyzer的jar包加到项目工程

2）把配置文件和扩展词典添加到工程的classpath下，注意：扩展词典严禁用windows记事本打开，保证扩展词典编码格式始终为UTF-8

3）扩展词典：添加一些新词

停用词典：敏感词，无意义的词。

3.使用：

//1.创建一个Directory对象，把索引库保存到磁盘
        Directory directory = FSDirectory.open(new File("E:\\lucene\\index").toPath());
        //2.基于Directory对象创建一个IndexWirter对象
        //要在代码中使用中文分析器，直接再IndexWriterConfig中传入IKAnalyzer对象
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);

七、索引库的维护

索引库的添加

Field域的属性（根据是否分析，是否索引，是否存储）

Field类	数据类型	Analyzed是否分析	Indexed是否索引	Stored是否存储	说明
StringField(FieldName, FieldValue,Store.YES))	字符串	N	Y	Y或N	这个Field用来构建一个字符串Field，但是不会进行分析，会将整个串存储在索引中，比如(订单号,姓名等)是否存储在文档中用Store.YES或Store.NO决定
LongPoint(String name, long… point)	Long型	Y	Y	N	可以使用LongPoint、IntPoint等类型存储数值类型的数据。让数值类型可以进行索引。但是不能存储数据，如果想存储数据还需要使用StoredField。
StoredField(FieldName, FieldValue)	重载方法，支持多种类型	N	N	Y	这个Field用来构建不同类型Field不分析，不索引，但要Field存储在文档中
TextField(FieldName, FieldValue, Store.NO)或TextField(FieldName, reader)	字符串或流	Y	Y	Y或N	如果是一个Reader, lucene猜测内容比较多,会采用Unstored的策略.

使用：

    @Test
    public void createIndex() throws Exception {
        //1.创建一个Directory对象，指定索引保存的位置

        //把索引库保存到内存中
//        RAMDirectory directory = new RAMDirectory();

        //创建一个Directory对象，把索引库保存到磁盘
        FSDirectory directory = FSDirectory.open(new File("E:\\lucene\\index").toPath());
        //2.基于Directory对象创建一个IndexWirter对象
        //要在代码中使用中文分析器，直接再IndexWriterConfig中传入IKAnalyzer对象
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new IKAnalyzer());
        IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
//        IndexWriter indexWriter = new IndexWriter(directory, new IndexWriterConfig());
        //3.读取磁盘上的文件，对应每个文件创建一个文档对象
        File dir = new File("E:\\lucene\\searchSource");
        File[] files = dir.listFiles();
        for (File file:files
             ) {
            //获取文件名
            String fileName = file.getName();
            //获取文件路径
            String filePath = file.getPath();
            //文本内容
            String fileContent = FileUtils.readFileToString(file, "utf-8");
            //文件大小
            long fileSize = FileUtils.sizeOf(file);
            //创建field
            Field fieldName = new org.apache.lucene.document.TextField("name", fileName, Field.Store.YES);
//            Field fieldPath = new org.apache.lucene.document.TextField("path",filePath, Field.Store.YES);
            //由于path不需要索引，也不需要分词，只需要存储，可以直接用StoredField
            Field fieldPath = new StoredField("path",filePath);
            Field fieldContent = new org.apache.lucene.document.TextField("content",fileContent, Field.Store.YES);
            //fileSize + ""：是为了把long转成字符串,存储长整形数据直接用`LongPoint`
            Field fieldSizeValue = new LongPoint("size",fileSize);
            Field fieldSizeStore = new StoredField("size", fileSize);
            //创建文档对象
            Document document = new Document();
            //4.向文档对象中添加域
            document.add(fieldName);
            document.add(fieldPath);
            document.add(fieldContent);
//            document.add(fieldSize);
            document.add(fieldSizeValue);
            document.add(fieldSizeStore);
            //5.把文档对象写入索引库
            indexWriter.addDocument(document);
        }
       //6.关闭indexwirter对象
        indexWriter.close();
    }

2.索引库的删除

1）删除全部 deleteAll()

2）按照查询删除索引库 deleteDocuments（）

3.索引库的修改

1) 更新索引库（先删除，后添加）updateDocument（）

/**
 * @author FENCO
 * @date 2020/1/12 14:36
 * @company HOWSO
 */
public class IndexManager {

    private IndexWriter indexWriter;

    @Before
    public void init() throws Exception {
        //1.创建一个IndexWriter对象，需要使用IKAnalyzer作为分析器
        indexWriter = new IndexWriter(FSDirectory.open(new File("E:\\lucene\\index").toPath()),
                                                       new IndexWriterConfig(new IKAnalyzer()));
    }
    /**
     * 索引库的添加
     * @throws Exception
     */
    @Test
    public void addDocument() throws Exception {
        //2.创建一个Document对象
        Document document = new Document();
        //3.向document对象中添加域
        document.add(new TextField("name", "新添加的文件", Field.Store.YES));
        document.add(new StoredField("path","E:\\lucene\\searchSource"));
        document.add(new TextField("content", "新添加的文件", Field.Store.YES));
        //4.把文档写进索引库
        indexWriter.addDocument(document);
        //5.关闭索引库
        indexWriter.close();
    }


    /**
     * 删除全部索引库
     * @throws Exception
     */
    @Test
    public void deleteAllDocument() throws Exception {
        //删除全部文档
        indexWriter.deleteAll();
        //关闭索引库
        indexWriter.close();
    }


    /**
     * 按照查询删除索引库
     */
    @Test
    public void deleteDocumentByQuery() throws Exception{
        //删除name中包含apache的文档
        indexWriter.deleteDocuments(new Term("name","apache"));
        //关闭索引库
        indexWriter.close();
    }


    /**
     * 更新索引库（先删除，后添加）
     * @throws Exception
     */
    @Test
    public void updateDocument() throws Exception {
        //创建一个新的文档对象
        Document document = new Document();
        //想文档对象中添加域
        document.add(new TextField("name1", "更新之后的文档1",Field.Store.YES));
        document.add(new TextField("name2", "更新之后的文档2",Field.Store.YES));
        document.add(new TextField("name3", "更新之后的文档3",Field.Store.YES));
        //更新操作
        indexWriter.updateDocument(new Term("name", "spring"),document);
        //关闭索引库
        indexWriter.close();
    }
}

八、Lucene索引库查询

1.使用Query的子类

1）TermQuery

根据关键词进行查询，需要指定要查询的域和要查询的关键词

2）RangeQuery

范围查询

2.使用QueryParser进行查询

可以对要查询的内容进行分词，然后基于分词的结果进行查询

添加lucene-queryparser-7.4.0.jar

使用步骤：

/**
 * @author FENCO
 * @date 2020/1/12 15:58
 * @company HOWSO
 */
public class SearchIndex {

    private IndexReader indexReader;
    private IndexSearcher indexSearcher;
    @Before
    public void init() throws Exception {
        indexReader = DirectoryReader.open(FSDirectory.open(new File("E:\\lucene\\index").toPath()));
        indexSearcher = new IndexSearcher(indexReader);
    }

    /**
     * 数值范围查询
     * @throws Exception
     */
    @Test
    public void testRangeQuery() throws Exception {
        //创建一个Query对象
        Query query = LongPoint.newRangeQuery("size", 0L, 100L);
        printResult(query);
    }

    private void printResult(Query query) throws Exception {
        //执行查询
        //参数1：查询对象  参数2：查询结果返回地最大记录数
        TopDocs topDocs = indexSearcher.search(query,100);
        //取查询结果总记录数
        System.out.println("查询结果总记录数" + topDocs.totalHits);
        //取文档列表
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        //打印文档n内容
        for (ScoreDoc doc:scoreDocs) {
            //取文档id
            int docId = doc.doc;
            //根据id取文档对象
            Document document = indexSearcher.doc(docId);
            System.out.println(document.get("name"));
            System.out.println(document.get("path"));
//            System.out.println(document.get("content"));
            System.out.println(document.get("size"));
            System.out.println("========================");
        }
        indexReader.close();
    }

    /**
     * QueryParser
     * 先分词再查询
     * @throws Exception
     */
    @Test
    public void testQueryParser() throws Exception {
        //创建一个QueryParser对象,两个参数
        //参数1：默认搜索域   参数2：分析器对象
        QueryParser queryParser = new QueryParser("name",new IKAnalyzer());
        //使用QueryParser对象创建一个Query对象
        Query query = queryParser.parse("Lucene是一个Java开发的全文检索工具包");
        //执行查询
        printResult(query);
    }
}

Fenco_Han

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lucene入门+实战

一、什么是全文检索1. 数据的分类结构化数据格式固定、长度固定、数据类型固定非结构化数据word文档、pdf文档，邮件、html格式不固定、长度不固定、数据类型不固定2. 数据的查询1）结构化数据的查询SQL语句，查询结构化数据的方法简单，速度快。2）非结构化数据的查询从文本文件中查询包含spring单词的文本文件。一个一个打开，然后查询。使用...
复制链接

扫一扫

专栏目录