lucene7GroupingBy分组封装类

最新推荐文章于 2024-07-25 17:09:29 发布

Drift2333

最新推荐文章于 2024-07-25 17:09:29 发布

阅读量201

点赞数

分类专栏： lucene 文章标签： lucene 全文检索

本文链接：https://blog.csdn.net/Drift2333/article/details/128715548

版权

lucene 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章介绍了Lucene作为全文检索引擎的基础架构，提供了索引、查询和部分文本分析功能。Solr则是在Lucene基础上的扩展，是一个企业级的搜索服务器，支持更丰富的查询语言和管理功能。此外，文章还讨论了在Lucene中进行分组统计的概念和相关参数，以及如何在测试中实现分组查询。

摘要由CSDN通过智能技术生成

一、Lucene

Lucene是一个开放源代码的全文检索引擎工具包，即它不是一个完整的全文检索引擎，而是一个全文检索引擎的架构，提供了完整的查询引擎和索引引擎，部分文本分析引擎（英文与德文两种西方语言）。Lucene的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎.

Lucene中包含了四种基本数据类型，分别是：

Index：索引，由很多的Document组成。
Document：由很多的Field组成，是Index和Search的最小单位。
Field：由很多的Term组成，包括Field Name和Field Value。
Term：由很多的字节组成。一般将Text类型的Field Value分词之后的每个最小单元叫做Term。

在lucene中，读写路径是分离的。写入的时候创建一个IndexWriter，而读的时候会创建一个IndexSearcher

Apache Solr

Solr是一个高性能，采用Java5开发，基于Lucene的全文搜索服务器。同时对其进行了扩展，提供了比Lucene更为丰富的查询语言，同时实现了可配置、可扩展并对查询性能进行了优化，并且提供了一个完善的功能管理界面，是一款非常优秀的全文搜索引擎。它对外提供类似于Web-service的API接口。用户可以通过http请求，向搜索引擎服务器提交一定格式的XML文件，生成索引；也可以通过Http Solr Get操作提出查找请求，并得到XML格式的返回结果；

Apache Solr是一个流行的开源搜索服务器，它通过使用类似REST的HTTP API，这就确保你能从几乎任何编程语言来使用solr。

Solr是一个开源搜索平台，用于构建搜索应用程序。它建立在Lucene(全文搜索引擎)之上。 Solr是企业级的，快速的和高度可扩展的。使用Solr构建的应用程序非常复杂，可提供高性能。

Solr和Lucene的本质区别有以下三点：搜索服务器，企业级和管理。Lucene本质上是搜索库，不是独立的应用程序，而Solr是。Lucene专注于搜索底层的建设，而Solr专注于企业应用。Lucene不负责支撑搜索服务所必须的管理，而Solr负责。所以说，一句话概括Solr: Solr是Lucene面向企业搜索应用的扩展。

二、Grouping

1.grouping介绍

我们在做lucene搜索的时候，可能会用到对某个条件的数据进行统计，比如统计有多少个省份，在sql查询中我们可以用distinct来完成类似的功能，也可以用group by来对查询的列进行分组查询。

group主要用户处理不同lucene中含有某个相同field值的不同document的分组统计。

2.grouping接收参数

groupField：要分组的字段；比如我们对省份（province）进行分组，要传入对应的值为province，要注意的是如果groupField在document中不存在，会返回一null的分组；

groupSort：分组是怎么排序的，排序字段决定了分组内容展示的先后顺序；

topNGroups：分组展示的数量，只计算0到topNGroup条记录；

groupOffset：从第几个TopGroup开始算起，举例来说groupOffset为3的话，会展示从3到topNGroup对应的记录，此数值我们可以用于分页查询；

withinGroupSort：每组内怎么排序；

maxDocsPerGroup：每组处理多少个document；

withinGroupOffset：每组显示的document初始位置；

3.其他重要参数

Sort里的属性	SortField里的属性	含义
Sort.INDEXORDER	SortField.FIELD_DOC	按照索引的顺序进行排序
Sort.RELEVANCE	SortField.FIELD_SCORE	按照关联性评分进行排序

三、测试

API推荐的分组方式现在主要是两种，一种双遍遍历法，一种单遍遍历法，现在已有封装类GoupingSearch可以实现两种不同的方式

实现功能：

1.按照作者进行分组，可指定GroupDocsLimit分组内文档的上限

2.分页；每页是所有组的穿插，轮询（后面补充todo\\\\\\\\\）

测试代码codeTest-easy-lucene使用
第一步：创建索引

//索引目录
    static String indexDir = "D:\\codeTest\\luceneTest\\easyLucene";
    static Analyzer analyzer = new StandardAnalyzer();
    //指定在哪个索引上进行分组
    static String groupField = "author";

    @Test
    public void mainTest() throws Exception{
        createIndex();
//        Directory directory = FSDirectory.open(Paths.get(indexDir));
//        IndexReader reader = DirectoryReader.open(directory);
//        IndexSearcher searcher = new IndexSearcher(reader);
//        Query query = new TermQuery(new Term("content", "random"));
//        /**每个分组内部的排序规则*/
//        Sort groupSort = Sort.RELEVANCE;
//        groupBy(searcher, query, groupSort);
    }


    /**
     * 创建测试用的索引文档
     *
     * @throws IOException
     */
    public static void createIndex() throws IOException {
        Directory dir = FSDirectory.open(Paths.get(indexDir));
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        IndexWriter writer = new IndexWriter(dir, indexWriterConfig);
        addDocuments(groupField, writer);
    }

    /**
     * 添加索引文档
     *
     * @param groupField
     * @param writer
     * @throws IOException
     */
    public static void addDocuments(String groupField, IndexWriter writer)
            throws IOException {
        // 0
        Document doc = new Document();
        addGroupField(doc, groupField, "author1");
        doc.add(new StringField("author", "author1", Field.Store.YES));
        doc.add(new TextField("content", "random text", Field.Store.YES));
        doc.add(new StringField("id", "1", Field.Store.YES));
        writer.addDocument(doc);

        // 1
        doc = new Document();
        addGroupField(doc, groupField, "author1");
        doc.add(new StringField("author", "author1", Field.Store.YES));
        doc.add(new TextField("content", "some more random text",
                Field.Store.YES));
        doc.add(new StringField("id", "2", Field.Store.YES));
        writer.addDocument(doc);

        // 2
        doc = new Document();
        addGroupField(doc, groupField, "author1");
        doc.add(new StringField("author", "author1", Field.Store.YES));
        doc.add(new TextField("content", "some more random textual data",
                Field.Store.YES));
        doc.add(new StringField("id", "3", Field.Store.YES));
        writer.addDocument(doc);

        // 3
        doc = new Document();
        addGroupField(doc, groupField, "author2");
        doc.add(new StringField("author", "author2", Field.Store.YES));
        doc.add(new TextField("content", "some random text", Field.Store.YES));
        doc.add(new StringField("id", "4", Field.Store.YES));
        writer.addDocument(doc);

        // 4
        doc = new Document();
        addGroupField(doc, groupField, "author3");
        doc.add(new StringField("author", "author3", Field.Store.YES));
        doc.add(new TextField("content", "some more random text",
                Field.Store.YES));
        doc.add(new StringField("id", "5", Field.Store.YES));
        writer.addDocument(doc);

        // 5
        doc = new Document();
        addGroupField(doc, groupField, "author3");
        doc.add(new StringField("author", "author3", Field.Store.YES));
        doc.add(new TextField("content", "random", Field.Store.YES));
        doc.add(new StringField("id", "6", Field.Store.YES));
        writer.addDocument(doc);

        // 6 -- no author field
        doc = new Document();
        addGroupField(doc, groupField, "author4");
        doc.add(new StringField("author", "author4", Field.Store.YES));
        doc.add(new TextField("content",
                "random word stuck in alot of other text", Field.Store.YES));
        doc.add(new StringField("id", "6", Field.Store.YES));
        writer.addDocument(doc);
        writer.commit();
        writer.close();
    }

    /**
     * 添加分组域
     *
     * @param doc
     *            索引文档
     * @param groupField
     *            需要分组的域名称
     * @param value
     *            域值
     */
    private static void addGroupField(Document doc, String groupField,
                                      String value) {
        //进行分组的域上建立的必须是SortedDocValuesField类型
        doc.add(new SortedDocValuesField(groupField, new BytesRef(value)));
    }

第二步：groupingBy

@Test
    public void lucene7GroupBy() throws Exception{
        GroupingSearch groupingSearch = new GroupingSearch(groupField);//指定要进行分组的索引
        groupingSearch.setGroupSort(new Sort(SortField.FIELD_SCORE));//指定分组排序规则
        groupingSearch.setFillSortFields(true);//是否填充SearchGroup的sortValues
        groupingSearch.setCachingInMB(4.0, true);
        groupingSearch.setAllGroups(true);
        //groupingSearch.setAllGroupHeads(true);
        groupingSearch.setGroupDocsLimit(10);//分组内的文档上限

        //不指定搜索词
        BooleanQuery query = new BooleanQuery.Builder()
                .add(new TermQuery(new Term("author", "author1")), BooleanClause.Occur.SHOULD)
                .add(new TermQuery(new Term
                                ("author", "author2")),
                        BooleanClause.Occur.SHOULD)
                .add(new TermQuery(new Term("author", "author3")), BooleanClause.Occur.SHOULD)
                .add(new TermQuery(new Term("author", "author4")), BooleanClause.Occur.SHOULD).build();

        //指定搜索词
//        Analyzer analyzer = new StandardAnalyzer();
//        QueryParser parser = new QueryParser("content", analyzer);
//        String queryExpression = "some content";
//        Query query = parser.parse(queryExpression);
        Directory directory = FSDirectory.open(Paths.get(indexDir));
        IndexReader reader = DirectoryReader.open(directory);
        IndexSearcher searcher = new IndexSearcher(reader);
        //在content索引上对包含some与content分词的索引进行具体查询，结果按照author索引的内容进行分组
        TopGroups<BytesRef> result = groupingSearch.search(searcher, query, 0, 1000);
        int totalHit = result.totalHitCount;
        //总命中数
        System.out.println("总命中数:"+totalHit);
        //分组数
        System.out.println("分组数:"+result.groups.length);
        //按照分组打印查询结果
        Map<String, List<Document>> groupingMap = new HashMap<>();
        for (GroupDocs<BytesRef> groupDocs : result.groups){
            List<Document> totalDoc = new ArrayList<>();
            if (groupDocs != null) {
                if (groupDocs.groupValue != null) {
                    System.out.println("分组:" + groupDocs.groupValue.utf8ToString());
                }else{
                    //由于建立索引时有一条数据没有在分组索引上建立SortedDocValued索引，因此这个分组的groupValue为null
                    System.out.println("分组:" + "unknow");
                }
                System.out.println("组内数据条数:" + groupDocs.totalHits);

                ScoreDoc[] scoreDocs = groupDocs.scoreDocs;
                int maxCount = Math.min(totalHit, scoreDocs.length);
                for(int i = 0; i < maxCount; i++){
                    Document document = searcher.doc(scoreDocs[i].doc);
                    totalDoc.add(document);
                }
                groupingMap.put(totalDoc.get(0).get("author"), totalDoc);
                for(ScoreDoc scoreDoc : groupDocs.scoreDocs){
                    System.out.println("author:" + searcher.doc(scoreDoc.doc).get("author"));
                    System.out.println("content:" + searcher.doc(scoreDoc.doc).get("content"));
                    System.out.println();
                }
                System.out.println("=====================================");
            }
        }
    }

Drift2333

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene7GroupingBy分组封装类

我们在做lucene搜索的时候，可能会用到对某个条件的数据进行统计，比如统计有多少个省份，在sql查询中我们可以用distinct来完成类似的功能，也可以用group by来对查询的列进行分组查询。group主要用户处理不同lucene中含有某个相同field值的不同document的分组统计。
复制链接

扫一扫

专栏目录