条件随机场 python_利用条件随机场工具CRF完成中文分词

最新推荐文章于 2021-10-02 20:37:10 发布

weixin_39785081

最新推荐文章于 2021-10-02 20:37:10 发布

阅读量361

点赞数

文章标签：条件随机场 python

之前用maxent进行中文分词，发现效果不是特别理想，所以又实验了下用CRF来进行中文分词。

这里先简单介绍下什么是CRF(条件随机场)：

introduction

Conditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequences, trees and lattices. The underlying idea is that of defining a conditional probability distribution over label sequences given a particular observation sequence, rather than a joint distribution over both label and observation sequences. The primary advantage of CRFs over hidden Markov models is their conditional nature, resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum entropy Markov models (MEMMs) and other conditional Markov models based on directed graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world tasks in many fields, including bioinformatics, computational linguistics and speech recognition.

上面这段介绍，把CRF和MEMM(最大熵模型)以及HMM(隐马尔可夫模型)作了对比，介绍了CRF跟HEMM和HMM相比的优势。CRF其实也是HMM的一种扩展，是一种计算联合概率分布的有效模型。

安装CRF和安装maxent差不多，都是./configure，make，sudo su，make install。CRF提供了各种语言的工具包，有java、python、perl等。我用的是python语言工具包，所以还要安装python工具包：python setup.py build ，(sudo) python setup.py install。安装完成后，可以打开python shell ，然后输入 import CRFPP，看看是否可以成功import，如果可以，就说明安装成功了。

CRF的example里有一个seg目录，里面是一个关于日文分词的例子。日文和中文很相似，所以用这个例子来训练model最合适了。

这里的训练集还是之前用到的icwb2-data，为了得到适合拿来训练的数据集(4-tag标记)，用到一个python脚本--make_crf_train_data.py：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

#make_crf_train_data.py

#得到CRF++要求的格式的训练文件

#用法：命令行--python make_crf_train_data.py input_file output_file

#4 tags for character tagging: B(Begin), E(End), M(Middle), S(Single)

import codecs

import sys

def character_tagging(input_file, output_file):

input_data = codecs.open(input_file, 'r', 'utf-8')

output_data = codecs.open(output_file, 'w', 'utf-8')

for line in input_data.readlines():

word_list = line.strip().split()

for word in word_list:

if len(word) == 1:

output_data.write(word + "\tS\n")

else:

output_data.write(word[0] + "\tB\n")

for w in word[1:len(word)-1]:

output_data.write(w + "\tM\n")

output_data.write(word[len(word)-1] + "\tE\n")

output_data.write("\n")

input_data.close()

output_data.close()

if __name__ == '__main__':

if len(sys.argv) != 3:

print "pls use: python make_crf_train_data.py input output"

sys.exit()

input_file = sys.argv[1]

output_file = sys.argv[2]

character_tagging(input_file, output_file)

执行“python make_crf_train_data.py ./icwb2-data/training/msr_training.utf8 msr_training.tagging4crf.utf8” 即可得到CRF++要求的格式的训练文件msr_training.tagging4crf.utf8

有了训练语料，接下来就可以利用crf的训练工具crf_learn来训练模型了，切换到CRF++0.58目录下，执行如下命令即可：

crf_learn -f 3 -c 1.5 ./example/seg/template ../msr_training.tagging4crf.utf8 crf_model，得到crf_model。

最后执行“python crf_segmenter.py crf_model blog_test.utf8 blog_test_segment.utf8”

crf_segmenter.py：

#!/usr/bin/env python

# -*- coding: utf-8 -*-

#crf_segmenter.py

#用法：命令行输入--python crf_segmenter.py crf_model input_file output_file

# 利用CRF自带的python工具包，对输入文本进行分词

import codecs

import sys

import CRFPP

def crf_segmenter(input_file, output_file, tagger):

input_data = codecs.open(input_file, 'r', 'utf-8')

output_data = codecs.open(output_file, 'w', 'utf-8')

for line in input_data.readlines():

tagger.clear()

for word in line.strip():

word = word.strip()

if word:

tagger.add((word + "\to\tB").encode('utf-8'))

tagger.parse()

size = tagger.size()

xsize = tagger.xsize()

for i in range(0, size):

for j in range(0, xsize):

char = tagger.x(i, j).decode('utf-8')

tag = tagger.y2(i)

if tag == 'B':

output_data.write(' ' + char)

elif tag == 'M':

output_data.write(char)

elif tag == 'E':

output_data.write(char + ' ')

else: #tag == 'S'

output_data.write(' ' + char + ' ')

output_data.write('\n')

input_data.close()

output_data.close()

if __name__ == '__main__':

if len(sys.argv) != 4:

print "pls use: python crf_segmenter.py model input output"

sys.exit()

crf_model = sys.argv[1]

input_file = sys.argv[2]

output_file = sys.argv[3]

tagger = CRFPP.Tagger("-m " + crf_model)

crf_segmenter(input_file, output_file, tagger)

这里的blog_test.utf8同样是之前的《Some important suggestions for jobseeker》这篇blog , blog_test_segment.utf8就是最后分好词的文本。

blog_test_segment.utf8:

下面这些建议都是我在《 Th eG oo gl eR es um e: Ho wt op re pa re fo ra ca re er an dl an da jo ba tA pp le ,M ic ro so ft ,G oo gl e, or an yT op Te ch Co mp an y 》 ( 中文名比较俗：《金领简历敲开苹果、微软、谷歌的大门》 ) 这本书上看到的。我对一些内容进行了去粗取精。有兴趣的可以去看原书。

1. 建立自己的成就记录。招聘人员想知道你能够设定宏达目标，并拥有完成这些目标的能力。你的成就可以体现在学业、项目工作、志愿者工作、职业经历和体育活动上。

2. 具备良好的笔头和口头表达能力。不管是书面沟通还是口头沟通都对你未来的职业发展至关重要。如果你还做不到游刃有余地在公众面前演讲，那就得多加练习。如果你的写作能力比较薄弱，可以上一门写作课或是写博客来锻炼自己。你不需要博览群书或提笔成章，但确实需要表述得清晰、专业。

3. 做得精而不是做得杂。 ( 关于这点必须做一个取舍。你做的事情越多越杂，就越有可能掌握通用技能。但如果你想专于一个领域的话，就得学会专注。 )

4. 成为一名领导者。

从上面的结果可以看出，效果也不比之前的maxent理想，但这里主要是我的电脑在作孽，由于原来的训练数据太大了，大概有400多万行数据，我在电脑上跑了三次，结果三次都导致了系统崩溃，最后一次我从晚上十点开始跑数据，一直到第二天凌晨一点还没好，后来我就去睡觉了，电脑开着让它继续运行，结果第二天醒来，程序又崩溃了，所以后来我就截取了训练数据里的大概五万行数据用来训练model，结果就导致了正确率不高的问题，其实如果电脑给力的话，分词准确率大概有96%。这也是CRF的一个特点，它需要大量的数据来进行训练才能得到理想的效果。

最终分词效率：

=== SUMMARY:

=== TOTAL INSERTIONS: 99

=== TOTAL DELETIONS: 30

=== TOTAL SUBSTITUTIONS: 156

=== TOTAL NCHANGE: 285

=== TOTAL TRUE WORD COUNT: 821

=== TOTAL TEST WORD COUNT: 890

=== TOTAL TRUE WORDS RECALL: 0.773

=== TOTAL TEST WORDS PRECISION: 0.713

=== F MEASURE: 0.742

=== OOV Rate: 0.141

=== OOV Recall Rate: 0.259

=== IV Recall Rate: 0.858

### blog_test_segment.utf8 99 30 156 285 821 890 0.773 0.713 0.742 0.141 0.259 0.858

2 thoughts on “利用条件随机场工具CRF完成中文分词”

我按照你的步骤来做这个CRF的分词，可是在运行python crf_segmenter.py crf_model blog_test.utf8 blog_test_segment.utf8这条命令时，报错：

Traceback (most recent call last):

File "crf_segmenter.py", line 10, in

import CRFPP

ImportError: No module named CRFPP

请问你的程序都在linux下运行的吗？请问这个问题应该怎么解决？

首先我的程序都是在ubuntu下运行测试的，当然你也可以在windows下运行，因为考虑到windows下安装以及编码问题比较麻烦。你遇到的上面这个问题，是因为你电脑上没安装过CRF++，你google下这个开源工具，后面所有的程序都是基于这个软件包的。

发表评论

电子邮件地址不会被公开。必填项已用*标注

姓名 *

电子邮件 *

站点

您可以使用这些HTML标签和属性：

weixin_39785081

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
条件随机场 python_利用条件随机场工具CRF完成中文分词

之前用maxent进行中文分词，发现效果不是特别理想，所以又实验了下用CRF来进行中文分词。这里先简单介绍下什么是CRF(条件随机场)：introductionConditional random fields (CRFs) are a probabilistic framework for labeling and segmenting structured data, such as sequ...
复制链接

扫一扫