hanlp中文语言处理--词典加载源码过程分析及自定义用户词汇添加

最新推荐文章于 2023-09-10 00:47:44 发布

风zi

最新推荐文章于 2023-09-10 00:47:44 发布

阅读量3.9k

点赞数

分类专栏：中文解析工具调用

本文链接：https://blog.csdn.net/qq_35241080/article/details/83185790

版权

中文解析工具调用专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、hanlp本地词典加载源码分析

hanlp在调用提供的函数处理文本时会先初始化本地词典，加载词典进入内存中

以中文分词接口为例子

1.调用分词函数入口

   public class DemoAtFirstSight
{
    public static void main(String[] args)
    {
        System.out.println("首次编译运行时，HanLP会自动构建词典缓存，请稍候……");
//        HanLP.Config.enableDebug();         // 为了避免你等得无聊，开启调试模式说点什么:-)
//调用分词函数 
System.out.println(HanLP.segment("我也不知道你好一个心眼儿啊，一半天欢迎使用HanLP汉语处理包！" +
        		"接下来请从其他Demo中体验HanLP丰富的功能~"));
    }
}

    /**跟下去
     * 分词
     *
     * @param text 文本
     * @return 切分后的单词
     */
    public static List<Term> segment(String text)
    {
        return StandardTokenizer.segment(text.toCharArray());
    }
    /**继续走
     * 分词
     * @param text 文本
     * @return 分词结果
     */
    public static List<Term> segment(char[] text)
    {
        return SEGMENT.seg(text);
    }


    /**还要跟进
     * 分词
     *
     * @param text 待分词文本
     * @return 单词列表
     */
    public List<Term> seg(char[] text)
    {
        assert text != null;
        //是否执行字符串正规化，例如繁体转换成简体，当然这不是主要的
        if (HanLP.Config.Normalization)
        {
            CharTable.normalization(text);
        }
        return segSentence(text);
    }

    /**继续前进
     * 给一个句子分词，可以看出这是一个抽象类，具体有子类来实现
     *
     * @param sentence 待分词句子
     * @return 单词列表
     */
    protected abstract List<Term> segSentence(char[] sentence);

2.具体功能由ViterbiSegment子类来实现
在这里插入图片描述

3.分词器

/**
 * Viterbi分词器<br>
 * 也是最短路分词，最短路求解采用Viterbi算法
 *
 * @author hankcs
 */
public class ViterbiSegment extends WordBasedSegment
{
    @Override
    protected List<Term> segSentence(char[] sentence)
    {
//        long start = System.currentTimeMillis();
		//1.进入加载核心词典
		/**
         * 核心词典路径
         */
      //  public static String CoreDictionaryPath = "data/dictionary/CoreNatureDictionary.txt";
        WordNet wordNetAll = new WordNet(sentence);
        生成词网
        generateWordNet(wordNetAll);
        ///生成词图
//        System.out.println("构图：" + (System.currentTimeMillis() - start));
        if (HanLP.Config.DEBUG)
        {
            System.out.printf("粗分词网：\n%s\n", wordNetAll);
        }
//        start = System.currentTimeMillis();
        List<Vertex> vertexList = viterbi(wordNetAll);
//        System.out.println("最短路：" + (System.currentTimeMillis() - start));
		
		//2.这里加载的是自定义词典
        if (config.useCustomDictionary)
        {
            if (config.indexMode > 0)
                combineByCustomDictionary(vertexList, wordNetAll);
            else combineByCustomDictionary(vertexList);
        }

        if (HanLP.Config.DEBUG)
        {
            System.out.println("粗分结果" + convert(vertexList, false));
        }

        // 数字识别
        if (config.numberQuantifierRecognize)
        {
            mergeNumberQuantifier(vertexList, wordNetAll, config);
        }

        // 实体命名识别
        if (config.ner)
        {
            WordNet wordNetOptimum = new WordNet(sentence, vertexList);
            int preSize = wordNetOptimum.size();
            if (config.nameRecognize)
            {
                PersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll);
            }
            if (config.translatedNameRecognize)
            {
                TranslatedPersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll);
            }
            if (config.japaneseNameRecognize)
            {
                JapanesePersonRecognition.recognition(vertexList, wordNetOptimum, wordNetAll);
            }
            if (config.placeRecognize)
            {
                PlaceRecognition.recognition(vertexList, wordNetOptimum, wordNetAll);
            }
            if (config.organizationRecognize)
            {
                // 层叠隐马模型——生成输出作为下一级隐马输入
                wordNetOptimum.clean();
                vertexList = viterbi(wordNetOptimum);
                wordNetOptimum.clear();
                wordNetOptimum.addAll(vertexList);
                preSize = wordNetOptimum.size();
                OrganizationRecognition.recognition(vertexList, wordNetOptimum, wordNetAll);
            }
            if (wordNetOptimum.size() != preSize)
            {
                vertexList = viterbi(wordNetOptimum);
                if (HanLP.Config.DEBUG)
                {
                    System.out.printf("细分词网：\n%s\n", wordNetOptimum);
                }
            }
        }

        // 如果是索引模式则全切分
        if (config.indexMode > 0)
        {
            return decorateResultForIndexMode(vertexList, wordNetAll);
        }

        // 是否标注词性
        if (config.speechTagging)
        {
            speechTagging(vertexList);
        }

        return convert(vertexList, config.offset);
    }

4.核心词典加载

	//1.由此接口进入
      生成词网
        generateWordNet(wordNetAll);
 /**
     * 生成一元词网
     *
     * @param wordNetStorage
     */
    protected void generateWordNet(final WordNet wordNetStorage)
    {
        final char[] charArray = wordNetStorage.charArray;

        // 核心词典查询，点击进入查看核心词典加载入内存中
        DoubleArrayTrie<CoreDictionary.Attribute>.Searcher searcher = CoreDictionary.trie.getSearcher(charArray, 0);
        while (searcher.next())
        {
            wordNetStorage.add(searcher.begin + 1, new Vertex(new String(charArray, searcher.begin, searcher.length), searcher.value, searcher.index));
        }
        // 强制用户词典查询，是否优先使用用户自定义词典
        if (config.forceCustomDictionary)
        {
            CustomDictionary.parseText(charArray, new AhoCorasickDoubleArrayTrie.IHit<CoreDictionary.Attribute>()
            {
                @Override
                public void hit(int begin, int end, CoreDictionary.Attribute value)
                {
                    wordNetStorage.add(begin + 1, new Vertex(new String(charArray, begin, end - begin), value));
                }
            });
        }
        // 原子分词，保证图连通
        LinkedList<Vertex>[] vertexes = wordNetStorage.getVertexes();
        for (int i = 1; i < vertexes.length; )
        {
            if (vertexes[i].isEmpty())
            {
                int j = i + 1;
                for (; j < vertexes.length - 1; ++j)
                {
                    if (!vertexes[j].isEmpty()) break;
                }
                wordNetStorage.add(i, quickAtomSegment(charArray, i - 1, j - 1));
                i = j;
            }
            else i += vertexes[i].getLast().realWord.length();
        }
    }

//CoreDictionary 核心词典加载---先确定是否有缓存文件，有则先加载缓存文件，无则生成缓存文件同时加载进入内存中，需要注意的是用户自定义词典添加词汇后，没有新生成缓存文件则词汇无法识别，缓存文件在词典文件同一目录下 .bin结尾的文件

//以下是核心词典类CoreDictionary中的一些主要属性******************
//加载词典在内存中的容器
  public static DoubleArrayTrie<Attribute> trie = new DoubleArrayTrie<Attribute>();
  //核心词典位置
    public final static String path = HanLP.Config.CoreDictionaryPath;
    public static final int totalFrequency = 221894;

    // 自动加载词典
    static
    {
        long start = System.currentTimeMillis();
        if (!load(path))//进入load函数
        {
            throw new IllegalArgumentException("核心词典" + path + "加载失败");
        }
        else
        {
            logger.info(path + "加载成功，" + trie.size() + "个词条，耗时" + (System.currentTimeMillis() - start) + "ms");
        }
    }

5.加载词典函数

//load函数中源码中作者已有注释这里不再多写。
    private static boolean load(String path)
    {
        logger.info("核心词典开始加载:" + path);
        //这个函数判断是否有缓存文件 bin 存在，有则直接加载缓存文件
        if (loadDat(path)) return true;
        TreeMap<String, CoreDictionary.Attribute> map = new TreeMap<String, Attribute>();
        BufferedReader br = null;
        try
        {
            br = new BufferedReader(new InputStreamReader(IOUtil.newInputStream(path), "UTF-8"));
            String line;
            int MAX_FREQUENCY = 0;
            long start = System.currentTimeMillis();
            while ((line = br.readLine()) != null)
            {
                String param[] = line.split("\\s");
                int natureCount = (param.length - 1) / 2;
                CoreDictionary.Attribute attribute = new CoreDictionary.Attribute(natureCount);
                for (int i = 0; i < natureCount; ++i)
                {
                    attribute.nature[i] = Nature.create(param[1 + 2 * i]);
                    attribute.frequency[i] = Integer.parseInt(param[2 + 2 * i]);
                    attribute.totalFrequency += attribute.frequency[i];
                }
                map.put(param[0], attribute);
                MAX_FREQUENCY += attribute.totalFrequency;
            }
            logger.info("核心词典读入词条" + map.size() + " 全部频次" + MAX_FREQUENCY + "，耗时" + (System.currentTimeMillis() - start) + "ms");
            br.close();
            trie.build(map);
            logger.info("核心词典加载成功:" + trie.size() + "个词条，下面将写入缓存……");
            try
            {
                DataOutputStream out = new DataOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT));
                Collection<CoreDictionary.Attribute> attributeList = map.values();
                out.writeInt(attributeList.size());
                for (CoreDictionary.Attribute attribute : attributeList)
                {
                    out.writeInt(attribute.totalFrequency);
                    out.writeInt(attribute.nature.length);
                    for (int i = 0; i < attribute.nature.length; ++i)
                    {
                        out.writeInt(attribute.nature[i].ordinal());
                        out.writeInt(attribute.frequency[i]);
                    }
                }
                trie.save(out);
                out.close();
            }
            catch (Exception e)
            {
                logger.warning("保存失败" + e);
                return false;
            }
        }
        catch (FileNotFoundException e)
        {
            logger.warning("核心词典" + path + "不存在！" + e);
            return false;
        }
        catch (IOException e)
        {
            logger.warning("核心词典" + path + "读取错误！" + e);
            return false;
        }

        return true;
    }

6.加载核心词典缓存文件

    /**
     * 从磁盘加载双数组
     *
     * @param path
     * @return
     */
    static boolean loadDat(String path)
    {
        try
        {
    //public final static String BIN_EXT = ".bin"; 这里可以看出加载的是缓存文件
            ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT);

            if (byteArray == null) return false;
            int size = byteArray.nextInt();
            CoreDictionary.Attribute[] attributes = new CoreDictionary.Attribute[size];
            final Nature[] natureIndexArray = Nature.values();
            for (int i = 0; i < size; ++i)
            {
                // 第一个是全部频次，第二个是词性个数
                int currentTotalFrequency = byteArray.nextInt();
                int length = byteArray.nextInt();
                attributes[i] = new CoreDictionary.Attribute(length);
                attributes[i].totalFrequency = currentTotalFrequency;
                for (int j = 0; j < length; ++j)
                {
                    attributes[i].nature[j] = natureIndexArray[byteArray.nextInt()];
                    attributes[i].frequency[j] = byteArray.nextInt();
                }
            }
            if (!trie.load(byteArray, attributes) || byteArray.hasMore()) return false;
        }
        catch (Exception e)
        {
            logger.warning("读取失败，问题发生在" + e);
            return false;
        }
        return true;
    }

上述过程是核心词典的加载过程，用户词典大致相同
此图可以看出bin结尾的就是词典的缓存文件。

在这里插入图片描述

二、自定义词汇添加

过程分析
1.添加新词需要确定无缓存文件，否则无法使用成功，因为词典会优先加载缓存文件
2.再确认缓存文件不在时，打开本地词典按照格式添加自定义词汇。
3.调用分词函数重新生成缓存文件，这时会报一个找不到缓存文件的异常，不用管，因为加载词典进入内存是会优先加载缓存，缓存不在当然会报异常，然后加载词典生成缓存文件，最后处理字符进行分词就会发现新添加的词汇可以进行分词了。

操作过程图解：

1.有缓存文件的情况下

 System.out.println(HanLP.segment("张三丰在一起我也不知道你好一个心眼儿啊，一半天欢迎使用HanLP汉语处理包！" +"接下来请从其他Demo中体验HanLP丰富的功能~"))

//首次编译运行时，HanLP会自动构建词典缓存，请稍候……
//[张/q, 三丰/nz, 在/p, 一起/s, 我/rr, 也/d, 不/d, 知道/v, 你好/vl, 一个心眼儿/nz, 啊/y, ，/w, 一半天/nz, 欢迎/v, 使用/v, HanLP/nx, 汉语/gi, 处理/vn, 包/v, ！/w, 接下来/vl, 请/v, 从/p, 其他/rzv, Demo/nx, 中/f, 体验/v, HanLP/nx, 丰富/a, 的/ude1, 功能/n, ~/nx]

2.打开用户词典–添加 ‘张三丰在一起’ 为一个 nz词性的新词
在这里插入图片描述

2.2 原始缓存文件下运行–会发现不成功，没有把 ‘张三丰在一起’ 分词一个nz词汇

 System.out.println(HanLP.segment("张三丰在一起我也不知道你好一个心眼儿啊，一半天欢迎使用HanLP汉语处理包！" +"接下来请从其他Demo中体验HanLP丰富的功能~"))

//首次编译运行时，HanLP会自动构建词典缓存，请稍候……
//[张/q, 三丰/nz, 在/p, 一起/s, 我/rr, 也/d, 不/d, 知道/v, 你好/vl, 一个心眼儿/nz, 啊/y, ，/w, 一半天/nz, 欢迎/v, 使用/v, HanLP/nx, 汉语/gi, 处理/vn, 包/v, ！/w, 接下来/vl, 请/v, 从/p, 其他/rzv, Demo/nx, 中/f, 体验/v, HanLP/nx, 丰富/a, 的/ude1, 功能/n, ~/nx]

3.1 删除缓存文件 bin
在这里插入图片描述

3.2 再次运行程序，此时会报错—无法找到缓存文件

 System.out.println(HanLP.segment("张三丰在一起我也不知道你好一个心眼儿啊，一半天欢迎使用HanLP汉语处理包！" +"接下来请从其他Demo中体验HanLP丰富的功能~"));

/**首次编译运行时，HanLP会自动构建词典缓存，请稍候……
十月 19, 2018 6:12:49 下午 com.hankcs.hanlp.corpus.io.IOUtil readBytes
WARNING: 读取D:/datacjy/hanlp/data/dictionary/custom/CustomDictionary.txt.bin时发生异常java.io.FileNotFoundException: D:\datacjy\hanlp\data\dictionary\custom\CustomDictionary.txt.bin (系统找不到指定的文件。)   找不到缓存文件


[张三丰在一起/nz, 我/rr, 也/d, 不/d, 知道/v, 你好/vl, 一个心眼儿/nz, 啊/y, ，/w, 一半天/nz, 欢迎/v, 使用/v, HanLP/nx, 汉语/gi, 处理/vn, 包/v, ！/w, 接下来/vl, 请/v, 从/p, 其他/rzv, Demo/nx, 中/f, 体验/v, HanLP/nx, 丰富/a, 的/ude1, 功能/n, ~/nx]

*/