Hanlp基本使用

最新推荐文章于 2024-05-15 15:24:11 发布

一默一语

最新推荐文章于 2024-05-15 15:24:11 发布

阅读量9k

点赞数 4

分类专栏：第三方集成文章标签： java 经验分享

本文链接：https://blog.csdn.net/qq_41610957/article/details/123055142

版权

第三方集成专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、Hanlp基本介绍

HanLP是一系列模型与算法组成的NLP工具包，目标是普及自然语言处理在生产环境中的应用。HanLP具备功能完善、性能高效、架构清晰、语料时新、可自定义的特点。

GitHub地址：https://github.com/hankcs/HanLP
官网地址：https://www.hanlp.com/

二、下载与配置

pom.xml文件中引入依赖

<dependency>
	<groupId>com.hankcs</groupId>
	<artifactId>hanlp</artifactId>
	<version>portable-1.8.2</version>
</dependency>

引入上述依赖后，即可使用基本功能（除由字构词、依存句法分析外的全部功能）。
用户自定义功能需安装数据包以及配置 hanlp.properties 文件
数据包文件：data.zip
HanLP中的数据分为词典和模型，其中词典是词法分析必需的，模型是句法分析必需的。用户可以自行增删替换，如果不需要句法分析等功能的话，随时可以删除model文件夹。

data
│
├─dictionary
└─model

三、文件配置

词典数据和 hanlp.properties 配置文件存放工程目录如图（存放位置可以随意，配置文件里面指定对应的词典数据文件位置即可）
在这里插入图片描述
hanlp.properties 配置文件修改，主要关注如下几个配置：

windows下使用只需要修改root指定数据包文件位置，如果要实现自定义词则在 CustomDictionaryPath 加上自定义文件即可。
linux中使用则除了root和CustomDictionaryPath需要相应修改，还需要重写默认的IO适配器。
对于自定义词典数据量少的时候可以通过代码内写入到词典中，而不需要落地到词典文件中，如 CustomDictionary.insert(自定义词,“自定义词性词频”);

# 指定Hanlp数据包文件位置
# root=D:/JavaProjects/HanLP/
# root=/home/aword/
root=src/main/resources/
#自定义词典路径，用;隔开多个自定义词典，空格开头表示在同一个目录，使用“文件名 词性”形式则表示这个词典的词性默认是该词性。优先级递减。
#所有词典统一使用UTF-8编码，每一行代表一个单词，格式遵从[单词] [词性A] [A的频次] [词性B] [B的频次] ... 如果不填词性则表示采用词典的默认词性。
CustomDictionaryPath=data/dictionary/custom/CustomDictionary.txt; 现代汉语补充词库.txt; 全国地名大全.txt ns; 人名词典.txt; 机构名词典.txt; 上海地名.txt ns;data/dictionary/person/nrf.txt nrf;
#默认的IO适配器如下，该适配器是基于普通文件系统的。
#IOAdapter=com.hankcs.hanlp.corpus.io.FileIOAdapter
# 重写适配器，指定文件即可
IOAdapter=com.aword.config.ResourceFileIoAdapter

ResourceFileIoAdapter.java

public class ResourceFileIoAdapter implements IIOAdapter {
    @Override
    public InputStream open(String s) throws IOException {
        //return  new FileInputStream(new ClassPathResource(path).getFile());
        return this.getClass().getClassLoader().getResourceAsStream(s);
    }

    @Override
    public OutputStream create(String s) throws IOException {
        return new FileOutputStream(new ClassPathResource(path).getFile());
    }
    
//    @Override
//    public InputStream open(String path) throws IOException {
//        String tempDir = Files.createTempDirectory("hanlp").toAbsolutePath().toString();
//        String cachePath = new File(tempDir + "/" + path).getPath().intern();
//        if (IOUtil.isFileExisted(cachePath)) {
//            return new FileInputStream(cachePath);
//        }
//        InputStream inputStream = IOUtil.getResourceAsStream("/" + path);
//        return inputStream;
//    }
//
//    @Override
//    public OutputStream create(String path) throws IOException {
//        String tempDir = Files.createTempDirectory("hanlp").toAbsolutePath().toString();
//        String cachePath = new File(tempDir + "/" + path).getPath().intern();
//        if (IOUtil.isResource(path)) {
//            mkdir(cachePath);
//            return new FileOutputStream(cachePath);
//        }
//        FileOutputStream fileOutputStream = new FileOutputStream(path);
//        return fileOutputStream;
//    }
//
//
//    private void mkdir(String cachePath) {
//        if (new File(cachePath).exists()) {
//            return;
//        }
//        String dir = cachePath.endsWith(File.separator) ? cachePath : StringUtils.substringBeforeLast(cachePath, File.separator);
//        new File(dir).mkdirs();
//    }    

}

四、基本使用

HanLP几乎所有的功能都可以通过工具类HanLP快捷调用，当你想不起来调用方法时，只需键入HanLP.，IDE应当会给出提示，并展示HanLP完善的文档，所有Demo都位于com.hankcs.demo下。
Hanlp词性表：HanLP词性标注集
第一个Demo

System.out.println(HanLP.segment("你好，欢迎使用HanLP汉语处理包！"));

标准分词
算法详解：词图的生成

List<Term> termList = StandardTokenizer.segment("商品和服务");
System.out.println(termList);

NLP分词

System.out.println(NLPTokenizer.segment("我新造一个词叫幻想乡你能识别并标注正确词性吗？"));
// 注意观察下面两个“希望”的词性、两个“晚霞”的词性
System.out.println(NLPTokenizer.analyze("我的希望是希望张晚霞的背影被晚霞映红").translateLabels());
System.out.println(NLPTokenizer.analyze("支援臺灣正體香港繁體：微软公司於1975年由比爾·蓋茲和保羅·艾倫創立。"));

极速词典分词

极速分词是词典最长分词，速度极其快，精度一般。
在i7-6700K上跑出了4500万字每秒的速度。

算法详解：《Aho Corasick自动机结合DoubleArrayTrie极速多模式匹配》

/**
 * 演示极速分词，基于AhoCorasickDoubleArrayTrie实现的词典分词，适用于“高吞吐量”“精度一般”的场合
 * @author hankcs
 */
public class DemoHighSpeedSegment
{
    public static void main(String[] args)
    {
        String text = "江西鄱阳湖干枯，中国最大淡水湖变成大草原";
        System.out.println(SpeedTokenizer.segment(text));
        long start = System.currentTimeMillis();
        int pressure = 1000000;
        for (int i = 0; i < pressure; ++i)
        {
            SpeedTokenizer.segment(text);
        }
        double costTime = (System.currentTimeMillis() - start) / (double)1000;
        System.out.printf("分词速度：%.2f字每秒", text.length() * pressure / costTime);
    }
}

用户自定义词典
算法详解：《Trie树分词》

/**
 * 演示用户词典的动态增删
 *
 * @author hankcs
 */
public class DemoCustomDictionary
{
    public static void main(String[] args)
    {
        // 动态增加
        CustomDictionary.add("攻城狮");
        // 强行插入
        CustomDictionary.insert("白富美", "nz 1024");
        // 删除词语（注释掉试试）
//        CustomDictionary.remove("攻城狮");
        System.out.println(CustomDictionary.add("单身狗", "nz 1024 n 1"));
        System.out.println(CustomDictionary.get("单身狗"));

        String text = "攻城狮逆袭单身狗，迎娶白富美，走上人生巅峰";  // 怎么可能噗哈哈！

        // AhoCorasickDoubleArrayTrie自动机扫描文本中出现的自定义词语
        final char[] charArray = text.toCharArray();
        CustomDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
        {
            @Override
            public void hit(int begin, int end, CoreDictionary.Attribute value)
            {
                System.out.printf("[%d:%d]=%s %s\n", begin, end, new String(charArray, begin, end - begin), value);
            }
        });

        // 自定义词典在所有分词器中都有效
        System.out.println(HanLP.segment(text));
    }
}

上述只简单列举几种基本的分词，更多分词及详细内容请参考 Hanlp官方文档