CRF++中文分词

最新推荐文章于 2020-10-20 14:20:51 发布

liu_zhlai

最新推荐文章于 2020-10-20 14:20:51 发布

阅读量1.1k

点赞数 1

分类专栏：自然语言处理

本文链接：https://blog.csdn.net/liu_zhlai/article/details/52335527

版权

自然语言处理专栏收录该内容

7 篇文章 0 订阅

订阅专栏

前面讲了对于CRF用于序列标注的基本问题的理解，本文记录下CRF用于中文分词的基本步骤。本文中选用的CRF是目前应用比较广的CRF++，分词语料用的是北京大学自然语言处理实验室标注好的人民日报98年1月的新闻语料。下面是具体的步骤：

1.CRF++安装

CRF++的官网：http://crfpp.sourceforge.net/

我用的是Ubutnu，所以，下载的是源码：http://sourceforge.net/projects/crfpp/files/ 下载CRF++-0.54.tar.gz,

   下载源码后，解压，编译、安装，编译过程中可能会报一些警告，直接忽略即可，有的版本可能汇报找不到头文件的错误，原因是inclue了windows版本的头文件，直接注释掉即可。
   ./configure
    make
    sudo make install

上述命令执行成功的话，在编译目录下会找到crf_learn, crf_test等几个可执行文件。

2 测试和体验
在源码包中有example，可以执行./exec.sh体验一下
exec.sh #训练和测试脚本
template #模板文件
test.data #测试文件
train.data #训练文件
可以打开看看

3 语料整理和模板编写

我采用的是4Tag的方式（有博客比较过4tag和6tag，6tag准确率略高，但是模型训练时间会长很多，由于我写这篇博客是在自己的笔记本电脑上测试的，电脑性能很有限

，就采用比较简单的4tag，同理，训练数据充足的前提下，模板越复杂、特征越多，准确率应该会越高，有兴趣的同学可以尝试更复杂的tag和模板）

4tag简单解释如下：

S，单个词；B，词首；E，词尾；M 词中

以“中华人民共和国”为例，CRF标记结果如下

1个字的词：
和 S
2个字的词(CRF标记结果实际上是一个字一行，我为了排版，改为横排的了)：
中 B 国 E
3个字的词：
河B北M省E

4个字以及4个以上字的词

中B华M人M民M共M和M国E

4.训练数据

人民日报标记语料可以从这个地址下载：http://www.cnblogs.com/eaglet/archive/2007/09/10/888377.html

该语料标记了词语和词性，格式如下：

19980101-01-001-001/m 迈向/v 充满/v 希望/n 的/u 新/a 世纪/n ——/w 一九九八年/t 新年/t 讲话/n （/w 附/v 图片/n １/m 张/q ）/w

第一个词是新闻时间，后文处理的时候，都去掉了该词。

5、语料处理

语料处理脚本如下,把语料按的90%作为训练数据，保存到train.data，语料的10%作为测试数据保存到test.data，同时为了后面的效果评估，测试数据同时保存到test_rel.data

#coding=utf8

import codecs
import sys

home_dir = "./"

def convertTag():    
    src_file    = codecs.open(home_dir + 'people-daily.txt','r')
    test_real_file =  codecs.open(home_dir +'test_rel.data','w', 'utf-8')
    test_file  = codecs.open(home_dir  +'test.data','w', 'utf-8')
    train_file = codecs.open(home_dir + 'train.data', 'w', 'utf-8')

    i = 0
 
    for line in src_file.readlines():
        #i += 1
        line = line.strip('\r\n\t ')
        if line =="":
            continue
        i += 1
        terms = line.split(" ")
        test = False

        if i % 10 == 0:
            test = True
            print line

        for term in terms:
            # delete []
            term = term.strip('\t ')

            i1 = term.find('[')
            if i1 >= 0 and len(term) > i1 + 1:
                term = term[i1+1:]
            i2 = term.find(']')
            if i2 > 0:
                term = term[:i2]
            if len(term) <= 0:
                continue
            
            word, pos  = term.split('/')
            if pos == 'm':
                continue
            if test == True:
                # test.data
                for w in word.decode('utf-8'):
                    test_file.write(w + u'\tB\n')
                # real data
                word = word.decode('utf-8')
                if len(word) == 1:
                    test_real_file.write(word + u'\tS\n')
                else:
                    test_real_file.write(word[0] + "\tB\n")
                    for w in word[1:len(word)-1]:
                        test_real_file.write(w + "\tM\n")
                    test_real_file.write(word[len(word)-1] + "\tE\n")
            # train data
            else:
                word = word.decode('utf-8')
                if len(word) == 1:
                    train_file.write(word + u'\tS\n')
                else:
                    train_file.write(word[0] + "\tB\n")
                    for w in word[1:len(word)-1]:
                        train_file.write(w + "\tM\n")
                    train_file.write(word[len(word)-1] + "\tE\n")

        if test:
            test_file.write(u'\n')
            test_file.flush()
            test_real_file.write(u'\n')
            test_real_file.flush()
        else:
            train_file.write(u'\n')
            train_file.flush()
    print i        

if __name__ == '__main__':    
    convertTag()

6,训练模型

本文使用的模板如下：

# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]

# Bigram
B

训练模型

 
../../crf_learn -c 10.0 template train.data seg_model
在我的电脑上训练时间 3479.77 s，迭代464次。

7.测试

../../crf_test -m seg_model test.data > test.res

这个时间非常快，秒级时间内完成

crf标记完的数据格式如下：

１	B	B
９	B	M
９	B	M
８	B	M
年	B	E
，	B	S
中	B	B
国	B	E
人	B	B
民	B	E

第一列是词语，第二列是测试数据标记之前的tag，最后一列是CRF标记结果。

8. 效果评估

在做效果评估之前遇到一个问题，CRF标记过程中，CRF标记后的数据test.res跟test_real.data函数不同(猜测是CRF标记过程中会去掉连续的空行，具体原因需要看下代码），为了评估方便，我先去掉了两份数据的空行

，命令如下

grep -v '^$' test_rel.data > test_rel.dat.new
grep -v '^$' test.res   > test.res.new

效果评估脚本如下：

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

if __name__=="__main__":
    flag_file = open(sys.argv[1], "r")
    real_file = open(sys.argv[2], "r")
    
    wc_of_test = 0
    wc_of_real = 0
    wc_of_correct = 0

    flag = True
    
    flag_lines = flag_file.readlines()
    real_lines = real_file.readlines()
    
    if len(flag_lines) != len(real_lines):
        print "flag_lines != real_lines"
        sys.exit()
    
    number = len(flag_lines)
    index = 0
    while index < number:
        flag_line = flag_lines[index]
        real_line = real_lines[index]

        if flag_line =='/n': continue
    
        _, _, g = flag_line.strip().split('\t')
        _, r = real_line.strip().split('\t')
     
        if r != g:
            flag = False
    
        if g in ('E', 'S'):
            wc_of_test += 1
            if flag:
                wc_of_correct +=1
            flag = True
    
        if r in ('E', 'S'):
            wc_of_real += 1

        index += 1

    print "WordCount from test result:", wc_of_test
    print "WordCount from real data:", wc_of_real
    print "WordCount of correct segs :", wc_of_correct
            
    #查全率
    P = wc_of_correct/float(wc_of_real)
    #查准率，召回率
    R = wc_of_correct/float(wc_of_test)
    
    print "P = %f, R = %f, F-score = %f" % (P, R, (2*P*R)/(P+R))

测试结果：

sudo python evaluate.py  test.res.new test_rel.data.new 
WordCount from test result: 109243
WordCount from real data: 109740
WordCount of correct segs : 105781
P = 0.963924, R = 0.968309, F-score = 0.966112

liu_zhlai

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
CRF++中文分词

前面讲了对于CRF用于序列标注的基本问题的理解，本文记录下CRF用于中文分词的基本步骤。本文中选用的CRF是目前应用比较广的CRF++，分词语料用的是北京大学自然语言处理实验室标注好的人民日报98年1月的新闻语料。下面是具体的步骤： 1.CRF++安装 CRF++的官网：http://crfpp.sourceforge.net/ 我用的是Ubutnu，所以，下载的是源
复制链接

扫一扫

专栏目录