HanLP分词原理剖析

最新推荐文章于 2023-12-26 08:55:05 发布

天下无敌笨笨熊

最新推荐文章于 2023-12-26 08:55:05 发布

阅读量1.5k

点赞数

分类专栏： NLP 文章标签：算法机器学习人工智能

本文链接：https://blog.csdn.net/tlxamulet/article/details/130582486

版权

NLP 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

hanlp是一套中文的NLP处理库，里面提供了分词、拼音、摘要等很多实用功能，本文我们只看分词能力。

分词原理

先根据核心词典（CoreNatureDictionary.txt）粗分，例如“话统计算”，粗分成：

[[ ], [话], [统, 统计], [计, 计算], [算], [ ]]

该步骤类似于结巴的全模式分词。

然后结合二元概率词典（CoreNatureDictionary.ngram.mini.txt）算最短路径，得到粗分结果：

[ , 话, 统计, 算, ]

接下来是做人名、地名、翻译名的识别。

这是粗分的代码：

protected void GenerateWordNet(final WordNet wordNetStorage)
    {
        final char[] charArray = wordNetStorage.charArray;
        // 核心词典查询
        DoubleArrayTrie<CoreDictionary.Attribute>.Searcher searcher = CoreDictionary.trie.getSearcher(charArray, 0);
        while (searcher.next())
        {
            wordNetStorage.add(searcher.begin + 1, new Vertex(new String(charArray, searcher.begin, searcher.length), searcher.value, searcher.index));
        }
        // 用户词典查询
        //        if (config.useCustomDictionary)
        //        {
        //            searcher = CustomDictionary.dat.getSearcher(charArray, 0);
        //            while (searcher.next())
        //            {
        //                wordNetStorage.add(searcher.begin + 1, new Vertex(new String(charArray, searcher.begin, searcher.length), searcher.value));
        //            }
        //        }
        // 原子分词，保证图连通
        // 如果输入中有英文或数字，vertexes中就会出现空的顶点，此时会合并英文字母为一个单词
        LinkedList<Vertex>[] vertexes = wordNetStorage.getVertexes();
        for (int i = 1; i < vertexes.length; )
        {
            if (vertexes[i].isEmpty())
            {
                //找出英文+数字的部分
                int j = i + 1;
                for (; j < vertexes.length - 1; ++j)
                {
                    if (!vertexes[j].isEmpty()) break;
                }
                //对英文+数字，使用quickAtomSegment做切分（主要是合并英文单词）
                wordNetStorage.add(i, quickAtomSegment(charArray, i - 1, j - 1));
                i = j;
            }
            else i += vertexes[i].getLast().realWord.length();
        }
    }

粗分之后算出最短路径（也可以理解为最大概率）下的分词结果：

private static List<Vertex> viterbi(WordNet wordNet)
    {
        // 避免生成对象，优化速度
        LinkedList<Vertex> nodes[] = wordNet.getVertexes();
        LinkedList<Vertex> vertexList = new LinkedList<Vertex>();
        //每个Vertex包含：源Vertex+权重，updateFrom确保找到最小权重，从而最终得到一条最短路径。权重的更新是正向的，从第一个Vertex到最后一个Vertex。
        for (Vertex node : nodes[1])
        {
            node.updateFrom(nodes[0].getFirst());
        }
        for (int i = 1; i < nodes.length - 1; ++i)
        {
            LinkedList<Vertex> nodeArray = nodes[i];
            if (nodeArray == null) continue;
            for (Vertex node : nodeArray)
            {
                if (node.from == null) continue;
                for (Vertex to : nodes[i + node.realWord.length()])
                {
                    to.updateFrom(node);
                }
            }
        }
        //最短路径的获取是逆向的，注意addFirst保证了先进后出栈的效果，最终vertexList的结果还是正向的
        Vertex from = nodes[nodes.length - 1].getFirst();
        while (from != null)
        {
            vertexList.addFirst(from);
            from = from.from;
        }
        return vertexList;
    }

这里函数取名viterbi可以商榷，应该是最小权重（最大概率）算法。

Vertex的updateFrom算权重，而权重决定了最大概率，非常关键，这是实现：

public void updateFrom(Vertex from)
    {
        //由于考虑了源Vertex的权重，这里的weigth已经是整条路径的权重了
        //MathTools.calculateWeight算的是“从源Vertex到自身的权重”
        double weight = from.weight + MathTools.calculateWeight(from, this);
        //没有源Vertex则新建；有则取更小权重的Vertex
        if (this.from == null || this.weight > weight)
        {
            this.from = from;
            this.weight = weight;
        }
    }

MathTools.calculateWeight的实现是：

public static double calculateWeight(Vertex from, Vertex to)
    {
        //源词频，就是CoreNatureDictionary.txt里的数值
        int frequency = from.getAttribute().totalFrequency;
        if (frequency == 0)
        {
            frequency = 1;  // 防止发生除零错误
        }
        
        //从CoreNatureDictionary.ngram.txt里得到二元词频
        int nTwoWordsFreq = CoreBiGramTableDictionary.getBiFrequency(from.wordID, to.wordID);
        //综合考虑源词频和二元词频，获得权重，注意这里取的是负对数，对数是为了解决浮点下溢问题，取负意味着是“最大概率”的反面：“最小路径”
        double value = -Math.log(dSmoothingPara * frequency / (MAX_FREQUENCY) + (1 - dSmoothingPara) * ((1 - dTemp) * nTwoWordsFreq / frequency + dTemp));
        //我个人觉得这里无需“负数取正”的操作，待分析
        if (value < 0.0)
        {
            value = -value;
        }
        //        logger.info(String.format("%5s frequency:%6d, %s nTwoWordsFreq:%3d, weight:%.2f", from.word, frequency, from.word + "@" + to.word, nTwoWordsFreq, value));
        return value;
    }

权重计算公式为：
$\alpha*\frac{uni\_freq}{total}+(1-\alpha)*((1-\delta)*\frac{bi\_freq}{uni\_freq}+\delta)$
从上述公式很容易发现：

一旦二元词频从0变成正数，即我们指明了某两个词更容易连接在一起的可能，权重会下降的很明显，代表这条分词路径更容易被选择。

因此，我们可以考虑用二元词频来解决部分“切分歧义”的问题。

上述阶段类似于结巴的精确模式，两者都是通过计算最大概率来算出一条最可能的分词路径。不同之处在于：结巴在计算DAG的每个节点概率时只考虑了单个词的词频（或称为“一元词频”），而HanLP则综合考虑了一元词频和二元词频。

最短路径下的分词结果出来后，使用自定义词典做合并：

protected static List<Vertex> combineByCustomDictionary(List<Vertex> vertexList)
    {
        //这里的wordNet就是粗分结果，粗分结果是自定义词典合并的基础
        Vertex[] wordNet = new Vertex[vertexList.size()];
        vertexList.toArray(wordNet);
        // DAT合并
        DoubleArrayTrie<CoreDictionary.Attribute> dat = CustomDictionary.dat;
        for (int i = 0; i < wordNet.length; ++i)
        {
            //下面的算法其实是计算粗分结果进一步组合成词的可能
            int state = 1;
            state = dat.transition(wordNet[i].realWord, state);
            if (state > 0)
            {
                int start = i;
                int to = i + 1;
                int end = to;
                CoreDictionary.Attribute value = dat.output(state);
                for (; to < wordNet.length; ++to)
                {
                    state = dat.transition(wordNet[to].realWord, state);
                    if (state < 0) break;
                    CoreDictionary.Attribute output = dat.output(state);
                    if (output != null)
                    {
                        value = output;
                        end = to + 1;
                    }
                }
                //将要组合的几个“原子词”拼装成一个大的词
                if (value != null)
                {
                    StringBuilder sbTerm = new StringBuilder();
                    for (int j = start; j < end; ++j)
                    {
                        sbTerm.append(wordNet[j]);
                        wordNet[j] = null;
                    }
                    wordNet[i] = new Vertex(sbTerm.toString(), value);
                    i = end - 1;
                }
            }
        }
        // BinTrie合并，算法原理同上
        if (CustomDictionary.trie != null)
        {
            for (int i = 0; i < wordNet.length; ++i)
            {
                if (wordNet[i] == null) continue;
                BaseNode<CoreDictionary.Attribute> state = CustomDictionary.trie.transition(wordNet[i].realWord.toCharArray(), 0);
                if (state != null)
                {
                    int start = i;
                    int to = i + 1;
                    int end = to;
                    CoreDictionary.Attribute value = state.getValue();
                    for (; to < wordNet.length; ++to)
                    {
                        if (wordNet[to] == null) continue;
                        state = state.transition(wordNet[to].realWord.toCharArray(), 0);
                        if (state == null) break;
                        if (state.getValue() != null)
                        {
                            value = state.getValue();
                            end = to + 1;
                        }
                    }
                    if (value != null)
                    {
                        StringBuilder sbTerm = new StringBuilder();
                        for (int j = start; j < end; ++j)
                        {
                            if (wordNet[j] == null) continue;
                            sbTerm.append(wordNet[j]);
                            wordNet[j] = null;
                        }
                        wordNet[i] = new Vertex(sbTerm.toString(), value);
                        i = end - 1;
                    }
                }
            }
        }
        vertexList.clear();
        for (Vertex vertex : wordNet)
        {
            if (vertex != null) vertexList.add(vertex);
        }
        return vertexList;
    }

综上，粗分阶段作者原本是考虑了用户词典的，但后来注释掉了，所以粗分结果只能从核心词典来，从实际使用来看，粗分非常之关键，粗分一旦出错，后面很难改回来，因为我们也看到了，自定义词典合并只是在粗分结果基础上组装更大的词，而并不会去切分已粗分结果。所以，我个人理解，原子新词（即不能再分的新词）需加到核心词典里去；组合新词则放到用户词典里。当然，“组合新词”也可放核心词典，但要把概率给够，否则可能无法分出来。二元词典则可用于解决“歧义切分”问题。

hanlp作者的博客：http://www.hankcs.com/nlp/segment/the-word-graph-is-generated.html