java对字符串进行分词并根据每个词出现的次数计算权重

t梧桐树t

于 2023-05-12 14:49:08 发布

阅读量419

点赞数

文章标签： lucene 搜索引擎

本文链接：https://blog.csdn.net/winerpro/article/details/130642009

版权

如果要对字符串进行分词的操作,我们可以借助很多成熟的工具包,例如Stanford, CoreNLP, Jieba, IK Analyzer, HanLP, lucene等诸多工具此处我是用的是lucene,原本我是想用IK分词器的,但是不知道为什么maven我始终下载不下来

lucene介绍

Lucene是一个基于Java语言开发的开源全文搜索引擎库。也就是说，它能够提供全文搜索、近似搜索、词法分析、查询解析以及索引等多种文本处理功能。Lucene 作为一个 Java开发包，强大、高效、精确地完成各种搜索任务，深受Java开发者的青睐。

Lucene采用简单明了的接口，易于操作，任何一名开发工程师都可以快速上手。通过使用Lucene，开发人员可以轻松地实现文本搜索、语言处理，提高其软件产品的质量和价值。它与 Solr和 ElasticSearch等搜索引擎的API兼容性良好，常用于企业级应用中，例如商务搜索引擎、文档管理系统、知识管理系统、邮件服务器、文件格式转换器等多种场景。

引入依赖

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>8.10.0</version>
        </dependency>

测试代码

public class Test {

    public static void main(String[] args) throws IOException {
        String text = "这句话是有回音的," +
                "是有回音的," +
                "有回音的," +
                "的";
        List<String> list = tokenizeString(text);
        // 遍历List，并统计每个字符串的出现次数
        Map<String, Integer> statistics = new HashMap<>(list.size());
        for (String str : list) {
            statistics.put(str, statistics.getOrDefault(str, 0) + 1);
        }

        for (Map.Entry<String, Integer> entry : statistics.entrySet()) {
            System.out.println(entry.getKey() + ":" + entry.getValue());
        }
    }

    public static List<String> tokenizeString(String str) throws IOException {
        //中文分词器
        Analyzer analyzer = new SmartChineseAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("text", new StringReader(str));

        List<String> tokens = new ArrayList<>();
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        tokenStream.reset();

        while (tokenStream.incrementToken()) {
            String term = charTermAttribute.toString();
            tokens.add(term);
        }

        analyzer.close();

        return tokens;
    }
}