Hanlp的学习和应用

最新推荐文章于 2024-05-15 15:24:11 发布

Less_weight

最新推荐文章于 2024-05-15 15:24:11 发布

阅读量933

点赞数

分类专栏： java技术库 nlp 文章标签：学习 java 开发语言 nlp 中文分词

本文链接：https://blog.csdn.net/Less_weight/article/details/127916953

版权

java技术库同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

nlp

1 篇文章 0 订阅

订阅专栏

Hanlp简介

官方文档

官网地址：https://www.hanlp.com/index.html
Github地址：https://github.com/hankcs/HanLP/tree/v1.7.8

简介

一款包含中英文分词、自定义分词、词性标注、关键词提取、情感分析等nlp功能的开源三方包。以快速上手，简单配置为突出特点。亲测0基础可上手。
如项目中有数据清洗，数据分析，数据感情分析类似需求时，可考虑直接使用该包进行数据处理。
以下为java spring 项目的使用方法。

快速使用

详细使用手册见官网和github文档

Maven依赖引入

如果使用的maven库已有该三方包，直接进行依赖引用。如果没有，先到github下载完整Hanlp包上传maven库。

<dependency>
    <groupId>com.hankcs</groupId>
    <artifactId>hanlp</artifactId>
    <version>portable-1.7.8</version>
</dependency>

引入后刷新maven install即可。可选版本见官方网站，这里使用portable-1.7.8

引入数据包 data

一般项目可直接下载官方数据包作为基础数据包使用。在基础数据包基础上，扩展自定义或其他领域数据包。
https://github.com/hankcs/HanLP/archive/refs/tags/v1.7.8.zip
数据包位置
如果已经引入依赖，三方包中已包含了data目录和基础的词库，但是如果需要自定义词库则必须自己增加该data目录到你的项目目录下。
放在哪里都行，但是建议放在resources目录下。
data包下，dictionary包含了基础数据包，并且后面要增加自定义数据包建议也放在这里统一管理。

hanlp.properties配置文件

hanlp.properties是是自定义词库配置需要写的配置文件，建议同放在resources目录下。

该文件用于指定各自定义内容路径。根据源码我们可以看到：

try
   {
       p.load(new InputStreamReader(Predefine.HANLP_PROPERTIES_PATH == null ?
                                        loader.getResourceAsStream("hanlp.properties") :
                                        new FileInputStream(Predefine.HANLP_PROPERTIES_PATH)
           , "UTF-8"));
   }
   catch (Exception e)
   {
       String HANLP_ROOT = System.getProperty("HANLP_ROOT");
       if (HANLP_ROOT == null) HANLP_ROOT = System.getenv("HANLP_ROOT");
       if (HANLP_ROOT != null)
       {
           HANLP_ROOT = HANLP_ROOT.trim();
           p = new Properties();
           p.setProperty("root", HANLP_ROOT);
           logger.info("使用环境变量 HANLP_ROOT=" + HANLP_ROOT);
       }
       else throw e;
   }
   String root = p.getProperty("root", "").replaceAll("\\\\", "/");
   if (root.length() > 0 && !root.endsWith("/")) root += "/";
   CoreDictionaryPath = root + p.getProperty("CoreDictionaryPath", CoreDictionaryPath);
   CoreDictionaryTransformMatrixDictionaryPath = root + p.getProperty("CoreDictionaryTransformMatrixDictionaryPath", CoreDictionaryTransformMatrixDictionaryPath);
   BiGramDictionaryPath = root + p.getProperty("BiGramDictionaryPath", BiGramDictionaryPath);
   CoreStopWordDictionaryPath = root + p.getProperty("CoreStopWordDictionaryPath", CoreStopWordDictionaryPath);
   CoreSynonymDictionaryDictionaryPath = root + p.getProperty("CoreSynonymDictionaryDictionaryPath", CoreSynonymDictionaryDictionaryPath);
   PersonDictionaryPath = root + p.getProperty("PersonDictionaryPath", PersonDictionaryPath);
   PersonDictionaryTrPath = root + p.getProperty("PersonDictionaryTrPath", PersonDictionaryTrPath);
   String[] pathArray = p.getProperty("CustomDictionaryPath", "data/dictionary/custom/CustomDictionary.txt").split(";");
   String prePath = root;
   for (int i = 0; i < pathArray.length; ++i)
    {
        if (pathArray[i].startsWith(" "))
        {
            pathArray[i] = prePath + pathArray[i].trim();
        }
        else
        {
            pathArray[i] = root + pathArray[i];
            int lastSplash = pathArray[i].lastIndexOf('/');
            if (lastSplash != -1)
            {
                prePath = pathArray[i].substring(0, lastSplash + 1);
            }
        }
    }
    CustomDictionaryPath = pathArray;
    tcDictionaryRoot = root + p.getProperty("tcDictionaryRoot", tcDictionaryRoot);
    if (!tcDictionaryRoot.endsWith("/")) tcDictionaryRoot += '/';
    PinyinDictionaryPath = root + p.getProperty("PinyinDictionaryPath", PinyinDictionaryPath);
    TranslatedPersonDictionaryPath = root + p.getProperty("TranslatedPersonDictionaryPath", TranslatedPersonDictionaryPath);
    JapanesePersonDictionaryPath = root + p.getProperty("JapanesePersonDictionaryPath", JapanesePersonDictionaryPath);
    PlaceDictionaryPath = root + p.getProperty("PlaceDictionaryPath", PlaceDictionaryPath);
    PlaceDictionaryTrPath = root + p.getProperty("PlaceDictionaryTrPath", PlaceDictionaryTrPath);
    OrganizationDictionaryPath = root + p.getProperty("OrganizationDictionaryPath", OrganizationDictionaryPath);
    OrganizationDictionaryTrPath = root + p.getProperty("OrganizationDictionaryTrPath", OrganizationDictionaryTrPath);
    CharTypePath = root + p.getProperty("CharTypePath", CharTypePath);
    CharTablePath = root + p.getProperty("CharTablePath", CharTablePath);
    PartOfSpeechTagDictionary = root + p.getProperty("PartOfSpeechTagDictionary", PartOfSpeechTagDictionary);
    WordNatureModelPath = root + p.getProperty("WordNatureModelPath", WordNatureModelPath);
    MaxEntModelPath = root + p.getProperty("MaxEntModelPath", MaxEntModelPath);
    NNParserModelPath = root + p.getProperty("NNParserModelPath", NNParserModelPath);
    PerceptronParserModelPath = root + p.getProperty("PerceptronParserModelPath", PerceptronParserModelPath);
    CRFSegmentModelPath = root + p.getProperty("CRFSegmentModelPath", CRFSegmentModelPath);
    HMMSegmentModelPath = root + p.getProperty("HMMSegmentModelPath", HMMSegmentModelPath);
    CRFCWSModelPath = root + p.getProperty("CRFCWSModelPath", CRFCWSModelPath);
    CRFPOSModelPath = root + p.getProperty("CRFPOSModelPath", CRFPOSModelPath);
    CRFNERModelPath = root + p.getProperty("CRFNERModelPath", CRFNERModelPath);
    PerceptronCWSModelPath = root + p.getProperty("PerceptronCWSModelPath", PerceptronCWSModelPath);
    PerceptronPOSModelPath = root + p.getProperty("PerceptronPOSModelPath", PerceptronPOSModelPath);
    PerceptronNERModelPath = root + p.getProperty("PerceptronNERModelPath", PerceptronNERModelPath);
    ShowTermNature = "true".equals(p.getProperty("ShowTermNature", "true"));
    Normalization = "true".equals(p.getProperty("Normalization", "false"));
    String ioAdapterClassName = p.getProperty("IOAdapter");

我们可在配置文件中自定义的内容包括

CoreDictionaryPath：如果将data数据包目录放在了其他位置，需要使用该配置指定data数据包位置
CoreDictionaryTransformMatrixDictionaryPath：没用过，应该是基础数据转矩阵的一个目录路径
BiGramDictionaryPath：没用过，应该是关于二元语法的目录路径
CoreStopWordDictionaryPath：基础核心排除此数据文件目录路径
CustomDictionaryPath：自定义分词数据目录，我们最需要用到的配置。将你存放自定义分词文件的目录写在这里。
IOAdapter：自定义IO适配器。在linux环境下需要用到。

其他内容目前未涉猎，大家可自行研究。

快速尝试

分词示例（官方）：

    public static void main(String[] args)
    {
        String[] testCase = new String[]{
                "商品和服务",
                "当下雨天地面积水分外严重",
                "结婚的和尚未结婚的确实在干扰分词啊",
                "买水果然后来世博园最后去世博会",
                "中国的首都是北京",
                "欢迎新老师生前来就餐",
                "工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作",
                "随着页游兴起到现在的页游繁盛，依赖于存档进行逻辑判断的设计减少了，但这块也不能完全忽略掉。",
        };
        for (String sentence : testCase)
        {
            List<Term> termList = HanLP.segment(sentence);
            System.out.println(termList);
        }
    }