自定义中文全文索引

最新推荐文章于 2025-04-19 05:00:00 发布

马超的博客

最新推荐文章于 2025-04-19 05:00:00 发布

阅读量1.5k

点赞数

分类专栏：知识图谱 Neo4j 大数据应用文章标签： neo4j 中文分词中文全文索引

本文链接：https://blog.csdn.net/superman_xxx/article/details/89204456

版权

Neo4j 同时被 3 个专栏收录

90 篇文章

订阅专栏

知识图谱

31 篇文章

订阅专栏

大数据应用

18 篇文章

订阅专栏

自定义中文全文索引

一、中文分词插件
- 1、分词组件的调整
- 2、分词测试
二、样例数据准备
三、通过中文全文分词组件创建节点索引
四、中文分词索引查询
五、总结

一、中文分词插件

NEO4J中文全文索引，分词组件使用IKAnalyzer。为了支持高版本LUCENE，IKAnalyzer需要做一些调整。

IKAnalyzer-3.0 旧版本实现参考

ELASTICSEARCH-IKAnlyzer 高版本实现参考

1、分词组件的调整

调整之后的分词组件 casia.isiteam.zdr.wltea

// 调整之后的实现
public final class IKAnalyzer extends Analyzer {

    // 默认细粒度切分 true-智能切分 false-细粒度切分
    private Configuration configuration = new Configuration(false);

    /**
     * IK分词器Lucene  Analyzer接口实现类
     * <p>
     * 默认细粒度切分算法
     */
    public IKAnalyzer() {
    }

    /**
     * IK分词器Lucene Analyzer接口实现类
     *
     * @param configuration IK配置
     */
    public IKAnalyzer(Configuration configuration) {
        super();
        this.configuration = configuration;
    }

    /**
     * 重载Analyzer接口，构造分词组件
     */
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer _IKTokenizer = new IKTokenizer(configuration);
        return new TokenStreamComponents(_IKTokenizer);
    }

}

2、分词测试

自定义分词函数

RETURN zdr.index.iKAnalyzer('复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？',true) AS words

在这里插入图片描述

 /**
     * @param text:待分词文本
     * @param useSmart:true 用智能分词，false 细粒度分词
     * @return
     * @Description: TODO(支持中英文本分词)
     */
    @UserFunction(name = "zdr.index.iKAnalyzer")
    @Description("Fulltext index iKAnalyzer - RETURN zdr.index.iKAnalyzer({text},true) AS words")
    public List<String> iKAnalyzer(@Name("text") String text, @Name("useSmart") boolean useSmart) {

        PropertyConfigurator.configureAndWatch("dic" + File.separator + "log4j.properties");
        Configuration cfg = new Configuration(useSmart);

        StringReader input = new StringReader(text.trim());
        IKSegmenter ikSegmenter = new IKSegmenter(input, cfg);

        List<String> results = new ArrayList<>();
        try {
            for (Lexeme lexeme = ikSegmenter.next(); lexeme != null; lexeme = ikSegmenter.next()) {
                results.add(lexeme.getLexemeText());
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return results;
    }

二、样例数据准备

# 构造样例数据
MERGE (a:Loc {name:'A'}) SET a.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？'
MERGE (b:Loc {name:'B'}) SET b.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？'
MERGE (c:Loc {name:'C'}) SET c.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？'
MERGE (d:Loc {name:'D'}) SET d.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？'
MERGE (e:Loc {name:'E'}) SET e.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？'
MERGE (f:Loc {name:'F'}) SET f.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！吖啶基氨基甲烷磺酰甲氧基苯胺是一种药嘛？'
MERGE (a)-[:ROAD {cost:50}]->(b)
MERGE (a)-[:ROAD {cost:50}]->(c)
MERGE (a)-[:ROAD {cost:100}]->(d)
MERGE (b)-[:ROAD {cost:40}]->(d)
MERGE (c)-[:ROAD {cost:40}]->(d)
MERGE (c)-[:ROAD {cost:80}]->(e)
MERGE (d)-[:ROAD {cost:30}]->(e)
MERGE (d)-[:ROAD {cost:80}]->(f)
MERGE (e)-[:ROAD {cost:40}]->(f);

三、通过中文全文分词组件创建节点索引

自定义创建索引过程

CALL zdr.index.addChineseFulltextIndex('IKAnalyzer', 'Loc', ['description']) YIELD message RETURN message

在这里插入图片描述

 @Procedure(value = "zdr.index.addChineseFulltextIndex", mode = Mode.WRITE)
    @Description("CALL zdr.index.addChineseFulltextIndex(String indexName, String labelName, List<String> propKeys) YIELD message RETURN message," +
            "为一个标签下的所有节点的指定属性添加索引")
    public Stream<NodeIndexMessage> addChineseFulltextIndex(@Name("indexName") String indexName,
                                                            @Name("labelName") String labelName, @Name("properties") List<String> propKeys) {
        Label label = Label.label(labelName);

        List<NodeIndexMessage> output = new ArrayList<>();

//        // 按照标签找到该标签下的所有节点
        ResourceIterator<Node> nodes = db.findNodes(label);
        System.out.println("nodes:" + nodes.toString());

        int nodesSize = 0;
        int propertiesSize = 0;
        while (nodes.hasNext()) {
            nodesSize++;
            Node node = nodes.next();
            System.out.println("current nodes:" + node.toString());

            // 每个节点上需要添加索引的属性
            Set<Map.Entry<String, Object>> properties = node.getProperties(propKeys.toArray(new String[0])).entrySet();
            System.out.println("current node properties" + properties);

            // 查询该节点是否已有索引，有的话删除
            Index<Node> index = db.index().forNodes(indexName, FULL_INDEX_CONFIG);
            System.out.println("current node index" + index);
            index.remove(node);

            // 为了该节点的每个需要添加索引的属性添加全文索引
            for (Map.Entry<String, Object> property : properties) {
                propertiesSize++;
                index.add(node, property.getKey(), property.getValue());
            }
        }

        String message = "IndexName:" + indexName + ",LabelName:" + labelName + ",NodesSize:" + nodesSize + ",PropertiesSize:" + propertiesSize;
        NodeIndexMessage indexMessage = new NodeIndexMessage(message);
        output.add(indexMessage);
        return output.stream();
    }

四、中文分词索引查询

自定义查询索引过程

CALL zdr.index.chineseFulltextIndexSearch('IKAnalyzer', 'description:吖啶基氨基甲烷磺酰甲氧基苯胺', 100) YIELD node RETURN node

在这里插入图片描述

CALL zdr.index.chineseFulltextIndexSearch('IKAnalyzer', 'description:复联* AND year:1999', 100) YIELD node,weight RETURN node.name,node.year,node.description,weight

在这里插入图片描述

  @Procedure(value = "zdr.index.chineseFulltextIndexSearch", mode = Mode.WRITE)
    @Description("CALL zdr.index.chineseFulltextIndexSearch(String indexName, String query, long limit) YIELD node RETURN node," +
            "执行LUCENE全文检索，返回前{limit个结果}")
    public Stream<ChineseHit> chineseFulltextIndexSearch(@Name("indexName") String indexName,
                                                         @Name("query") String query, @Name("limit") long limit) {
        if (!db.index().existsForNodes(indexName)) {
            log.debug("如果索引不存在则跳过本次查询：`%s`", indexName);
            return Stream.empty();
        }
        return db.index()
                .forNodes(indexName, FULL_INDEX_CONFIG)
                .query(new QueryContext(query).sortByScore().top((int) limit))
                .stream()
                .map(ChineseHit::new);  // provider
    }

【跨标签类型检索】使用addChineseFulltextIndex给标签下节点属性添加的索引，默认可以使用chineseFulltextIndexSearch合并检索出来

// 增加一个非Loc标签的节点，然后使用检索
CREATE (n:LocProvince {name:'P'})  SET n.description='复联终章快上映了好激动，据说知识图谱与人工智能技术应用到了那部电影！' RETURN n
// 节点增加索引（索引名与已有相同）
CALL zdr.index.addChineseFulltextIndex('IKAnalyzer', 'LocProvince', ['description','year']) YIELD message RETURN message
// 通过属性检索节点
CALL zdr.index.chineseFulltextIndexSearch('IKAnalyzer', 'description:复联', 100) YIELD node,weight RETURN node