Lucene

最新推荐文章于 2024-06-05 09:00:00 发布

qq_39013701

最新推荐文章于 2024-06-05 09:00:00 发布

阅读量93

点赞数

分类专栏： java

java 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

一.数据的分类
1.结构化数据
有固定类型或者固定长度的数据
例如：数据库中的数据，元数据(就是windows中的数据)

2.非结构化数据：
没有固定类型或固定长度的数据
例如：Word文档中的数据，邮件中的数据

结构化数据的搜索方法：
如：数据库中的数据通过sql语句可以搜索
如：元数据(Windows中的)通过windows提供的搜索栏搜索

非结构化数据搜索方法：
如：word文档使用 ctrl + f 来搜索
ctrl + f 使用的顺序扫描法，即顺序逐个匹配

二.常用搜索算法
1.顺序扫描法：逐字匹配，直到找到内容为止
优点：如果存在要找的内容，就一定能找到
缺点：慢，效率低

2.全文检索算法(倒排索引算法)：将文件中的内容提取出来，将文字拆分成一个一个的词(分词)，将这些词组成的索引(字典中的目录)，搜索的时候先搜索索引，通过索引找文档，这个过程就叫做全文检索
优点：搜索速度快
缺点：因为创建的索引需要占用磁盘空间，所以这个算法使用掉更多的磁盘空间，这是用空间换时间

三.Lucene
1.概念
lucene是Apache旗下的顶级项目，是一个全文检索工具包
lucene就是一个可以创建全文检索引擎系统的一堆jar包，可以使用它来构建全文检索引擎系统，但是它本身不能独立运行

2.应用领域
(1)互联网全文检索引擎
(2)站内全文检索引擎

3.lucene结构
分为索引和文档
在这里插入图片描述
索引:
域名:词这样的形式,
它里面有指针执行这个词来源的文档

索引库: 放索引的文件夹(这个文件夹可以自己随意创建,在里面放索引就是索引库)
Term词元: 就是一个词, 是lucene中词的最小单位

文档:
Document对象,一个Document中可以有多个Field域对象,Field域对象中是key value键值对的形式:有域名和域值,
一个document就是数据库表中的一条记录, 一个Filed域对象就是数据库表中的一行一列
这是一个通用的存储结构.

4.demo
编写创建索引库的代码
IndexManagerTest.java

package com.miracle.lucene;

import org.apache.commons.io.FileUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
import java.io.File;
import java.util.ArrayList;
import java.util.List;

@SuppressWarnings("all")
public class IndexManagerTest {

    /**
     * 索引的创建
     * @throws Exception
     */
    @Test
    public void testIndexCreate() throws Exception{
        // 创建文档列表，保存多个Document
        List<Document> documentList = new ArrayList<Document>();

        // 采集文件系统中的文档数据，放入lucene中
        // 指定文件所在目录
        File dir = new File("/home/miracle/Desktop/searchsource");
        // 循环文件夹取出文件
        for (File file : dir.listFiles()) {
            // 文件名称
            String fileName = file.getName();
            // 文件内容
            String fileContext = FileUtils.readFileToString(file);
            // 文件大小
            Long fileSize = FileUtils.sizeOf(file);
            // 文档对象，文件系统中的一个文件就是一个Document对象
            Document document = new Document();
            // 第一个参数：域名
            // 第二个参数：域值
            // 第三个参数：是否存储，是为yes，不存储为no

            // 是否分词：要，因为它要索引，并且它不是一个整体，分词有意义
            // 是否索引：要，因为要通过它来进行索引
            // 是否存储：要，因为要直接在页面上显示
            TextField nameField = new TextField("fileName", fileName , Field.Store.YES);
            // 是否分词：要，因为要根据内容进行搜索，并且它分词有意义
            // 是否索引：要，因为要通过它来进行索引
            // 是否存储：可以要也可以不要，不存储搜索完内容就提取不出来
            TextField contextField = new TextField("fileContext", fileContext , Field.Store.NO);
            // 是否分词：要，因为数字要对比，搜索文档的时候可以搜索大小，lucene内部对数字进行了分词算法
            // 是否索引：要，因为要根据大小进行搜索
            // 是否存储：要，因为要显示文档大小
            LongField sizeField = new LongField("fileSize", fileSize, Field.Store.YES);

            // 将所有的域存入文档中
            document.add(nameField);
            document.add(contextField);
            document.add(sizeField);

            // 将文档存入文档集合中
            documentList.add(document);
        }

        // 创建中文分词器，StandardAnalyzer标准分词器，标准分词器对英文分词效果很好，对中文是单字分词
        Analyzer analyzer = new IKAnalyzer();
        // 指定索引和文档存储的目录
        Directory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        // 创建写对象的初始化对象
        // 指定版本 和 分词器
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        // 创建索引和文档写对象
        // 第一个参数：指定索引和文档存储的目录
        // 第二个参数：
        IndexWriter indexWriter = new IndexWriter(directory, config);
        // 将文档加入到索引和文档的写对象中
        for (Document doc : documentList) {
            indexWriter.addDocument(doc);
        }
        // 提交
        indexWriter.commit();
        // 关闭流
        indexWriter.close();
    }

    /**
     * 索引的删除
     * @throws Exception
     */
    @Test
    public void testIndexDel() throws Exception{
        // 创建中文分词器
        IKAnalyzer analyzer = new IKAnalyzer();
        // 指定索引和文档储存目录
        Directory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        // 创建写对象的初始化对象
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        // 创建索引和文档写对象
        IndexWriter indexWriter = new IndexWriter(directory, config);
        // 删除所有
        // indexWriter.deleteAll();
        // 根据名称进行删除，Term词元，第一个参数：域名，第二个参数：要删除含有此关键词的数据
        indexWriter.deleteDocuments(new Term("fileName", "apache"));
        // 提交
        indexWriter.commit();
        // 关闭
        indexWriter.close();
    }

    /**
     * 更新文档内容: 就是按照传入的Term进行搜索，如果找到结果那么删除，将更新的内容重新生成一个Document对象
     * 如果没有搜索到结果，那么将更新的内容直接添加一个新的Document对象
     * @throws Exception
     */
    @Test
    public void testIndexUpdate() throws Exception{
        // 创建中文分词器
        IKAnalyzer analyzer = new IKAnalyzer();
        // 指定索引和文档储存目录
        Directory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        // 创建写对象的初始化对象
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
        // 创建索引和文档写对象
        IndexWriter indexWriter = new IndexWriter(directory, config);

        // 根据文件名称进行更新
        Term term = new Term("fileName", "web");
        // 更新的对象
        Document doc = new Document();
        doc.add(new TextField("fileName", "xxxxxx", Field.Store.YES));
        doc.add(new TextField("fileContext", "think in java xxxxxxx", Field.Store.NO));
        doc.add(new LongField("fileSize", 100L, Field.Store.YES));
        // 更新
        indexWriter.updateDocument(term, doc);
        // 提交
        indexWriter.commit();
        // 关闭
        indexWriter.close();
    }
}

编写查询代码
IndexSearchTest.java

package com.miracle.lucene;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
import java.io.File;

@SuppressWarnings("all")
public class IndexSearchTest {

    @Test
    public void testIndexSearch() throws Exception{
        // 创建中文分词器,(创建索引时的分词器 要和 查询时用的分词器 必须一致)
        Analyzer analyzer = new IKAnalyzer();
        // 指定索引和文档的目录
        FSDirectory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        IndexReader indexReader = IndexReader.open(directory);
        // 创建索引的搜索对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        // 创建查询对象
        // 第一个参数：默认搜索域  如果下面搜索语法指定 搜索域名 则按照指定搜索域名 如果不指定 搜索域名 则按照默认搜索域名
        // 第二个参数：分词器
        QueryParser queryParser = new QueryParser("fileContext", analyzer);
        // 查询语法：    域名:搜索的关键字
        Query query = queryParser.parse("fileName:apache");
        // 搜索
        // 第一个参数：未查询语句对象
        // 第二个参数：指定显示多少条
        TopDocs topDocs = indexSearcher.search(query, 10);
        // 一共搜索到多少条记录
        System.out.println("=====count=====" + topDocs.totalHits);
        // 从搜索结果对象中获取结果集
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 获取docID
            int docID = scoreDoc.doc;
            // 通过文档ID从硬盘中读取对应的文档
            Document document = indexReader.document(docID);
            // get域名可以取出值 打印
            System.out.println("fileName " + document.get("fileName"));
            System.out.println("fileSize " + document.get("fileSize"));
            System.out.println("===============================================");
        }
    }

    @Test
    public void testIndexTermQuery() throws Exception{
        // 创建中文分词器,(创建索引时的分词器 要和 查询时用的分词器 必须一致)
        Analyzer analyzer = new IKAnalyzer();
        // 指定索引和文档的目录
        FSDirectory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        IndexReader indexReader = IndexReader.open(directory);
        // 创建索引的搜索对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        // 创建词元：就是词
        Term term = new Term("fileName", "apache");
        // 使用TermQuery查询，根据term对象进行查询
        TermQuery termQuery = new TermQuery(term);

        // 搜索
        // 第一个参数：未查询语句对象
        // 第二个参数：指定显示多少条
        TopDocs topDocs = indexSearcher.search(termQuery, 10);
        // 一共搜索到多少条记录
        System.out.println("=====count=====" + topDocs.totalHits);
        // 从搜索结果对象中获取结果集
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 获取docID
            int docID = scoreDoc.doc;
            // 通过文档ID从硬盘中读取对应的文档
            Document document = indexReader.document(docID);
            // get域名可以取出值 打印
            System.out.println("fileName " + document.get("fileName"));
            System.out.println("fileSize " + document.get("fileSize"));
            System.out.println("===============================================");
        }
    }

    /**
     * 根据数字范围查询
     * @throws Exception
     */
    @Test
    public void testNumericRangeQuery() throws Exception{
        // 创建中文分词器,(创建索引时的分词器 要和 查询时用的分词器 必须一致)
        Analyzer analyzer = new IKAnalyzer();
        // 指定索引和文档的目录
        FSDirectory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        IndexReader indexReader = IndexReader.open(directory);
        // 创建索引的搜索对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        // 查询文件大小，大于100 小于1000的文章
        // 第一个参数：域名  第二个参数：最小值  第三个参数：最大值   第四个参数：是否包含最小值   第五个参数：是否包含最大值
        Query query = NumericRangeQuery.newLongRange("fileSize",100L ,1000L, true, true);

        // 搜索
        // 第一个参数：未查询语句对象
        // 第二个参数：指定显示多少条
        TopDocs topDocs = indexSearcher.search(query, 10);
        // 一共搜索到多少条记录
        System.out.println("=====count=====" + topDocs.totalHits);
        // 从搜索结果对象中获取结果集
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 获取docID
            int docID = scoreDoc.doc;
            // 通过文档ID从硬盘中读取对应的文档
            Document document = indexReader.document(docID);
            // get域名可以取出值 打印
            System.out.println("fileName " + document.get("fileName"));
            System.out.println("fileSize " + document.get("fileSize"));
            System.out.println("===============================================");
        }
    }

    /**
     * 组合查询
     * @throws Exception
     */
    @Test
    public void testBooleanQuery() throws Exception{
        // 创建中文分词器,(创建索引时的分词器 要和 查询时用的分词器 必须一致)
        Analyzer analyzer = new IKAnalyzer();
        // 指定索引和文档的目录
        FSDirectory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        IndexReader indexReader = IndexReader.open(directory);
        // 创建索引的搜索对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        // 查询文件大小，大于100 小于1000的文章
        // 第一个参数：域名  第二个参数：最小值  第三个参数：最大值   第四个参数：是否包含最小值   第五个参数：是否包含最大值
        Query numericQuery = NumericRangeQuery.newLongRange("fileSize",100L ,1000L, true, true);

        // 创建词元：就是词
        Term term = new Term("fileName", "apache");
        // 使用TermQuery查询，根据term对象进行查询
        TermQuery termQuery = new TermQuery(term);

        // 布尔查询，就是可以根据多个条件组合进行查询
        // 文件名称包含apache的，并且文件大小 大于等于100 小于等于1000字节的文章
        BooleanQuery booleanQuery = new BooleanQuery();
        // 将多个条件添加到 BooleanQuery中
        // booleanQuery.add 第一个参数 查询条件  第二个参数 条件之间的关系(与或非 分别对应，Occur.MUST,Occur.SHOULD,Occur.MUST_NOT)
        booleanQuery.add(termQuery, BooleanClause.Occur.MUST);
        booleanQuery.add(numericQuery, BooleanClause.Occur.MUST);

        // 搜索
        // 第一个参数：未查询语句对象
        // 第二个参数：指定显示多少条
        TopDocs topDocs = indexSearcher.search(booleanQuery, 10);
        // 一共搜索到多少条记录
        System.out.println("=====count=====" + topDocs.totalHits);
        // 从搜索结果对象中获取结果集
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 获取docID
            int docID = scoreDoc.doc;
            // 通过文档ID从硬盘中读取对应的文档
            Document document = indexReader.document(docID);
            // get域名可以取出值 打印
            System.out.println("fileName " + document.get("fileName"));
            System.out.println("fileSize " + document.get("fileSize"));
            System.out.println("===============================================");
        }
    }

    /**
     *
     * @throws Exception
     */
    @Test
    public void testMultiFieldQueryParser() throws Exception{
        // 创建中文分词器,(创建索引时的分词器 要和 查询时用的分词器 必须一致)
        Analyzer analyzer = new IKAnalyzer();
        // 指定索引和文档的目录
        FSDirectory directory = FSDirectory.open(new File("/home/miracle/Desktop/searchsource/cache"));
        IndexReader indexReader = IndexReader.open(directory);
        // 创建索引的搜索对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        String [] fields = {"fileName", "fileContext"};
        // 从文件名称和文件内容中查询，只要含有apache就查出来
        MultiFieldQueryParser multiFieldQueryParser = new MultiFieldQueryParser(fields, analyzer);
        // 需要搜索的关键字
        Query query = multiFieldQueryParser.parse("apache");

        // 搜索
        // 第一个参数：未查询语句对象
        // 第二个参数：指定显示多少条
        TopDocs topDocs = indexSearcher.search(query, 10);
        // 一共搜索到多少条记录
        System.out.println("=====count=====" + topDocs.totalHits);
        // 从搜索结果对象中获取结果集
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        for (ScoreDoc scoreDoc : scoreDocs) {
            // 获取docID
            int docID = scoreDoc.doc;
            // 通过文档ID从硬盘中读取对应的文档
            Document document = indexReader.document(docID);
            // get域名可以取出值 打印
            System.out.println("fileName " + document.get("fileName"));
            System.out.println("fileSize " + document.get("fileSize"));
            System.out.println("===============================================");
        }
    }
}

5.域的详细介绍
(1)是否分词
分词的作用是为了索引
需要分词的地方：文件名称，文件内容
不需要分词：不需要索引的域不需要分词，还有就是分词后无意义的域不需要分词
不需要分词的比如：数据库id，身份证号
(2)是否索引
索引的作用是为了搜索
需要搜索的域就一定要创建索引，只有创建了索引才能被搜索出来
不需要搜索的域可以不创建索引
需要索引的比如：文件名称，文件内容，id，身份证号等
不需要索引的比如：图片地址不需要创建索引(e:\xxx.jsp)
(3)是否存储
是否存储看个人需要，存储就是将内容放入Document文档对象中保存起来，会额外占用磁盘空间，如果搜索的时候需要马上显示出来可以放入Document中，也就是要存储，这样查询显示速度快，如果不是马上立刻显示出来，则不需要存储，因为额外占用磁盘空间，不划算
存储的目的是为了显示

域的各种类型

在这里插入图片描述
注意:lucene底层的算法,钱数是要分词的,因为要根据价钱进行对比
例如: 大于12.5元的小于100元的商品搜索出来