r语言中文注释不是utf-8_使用CRF++进行中文分词（傻瓜版）

最新推荐文章于 2021-09-28 18:36:20 发布

weixin_39535527

最新推荐文章于 2021-09-28 18:36:20 发布

阅读量191

点赞数

文章标签： r语言中文注释不是utf-8

本文链接：https://blog.csdn.net/weixin_39535527/article/details/111609383

版权

之前在公司的linux服务器上面干啥啥方便，但是迁移到本地mac就有一些麻烦，暂时也不知道为什么。我还是重新记录一下吧。

安装：

环境：

macos

conda python3

需要安装C++ compiler (gcc 3.0 or higher)

下载好之后解压缩，然后cd进入目录进行安装：

./configure
make
sudo make install

之后会生成一个python目录，进入：

python setup.py build
python setup.py install

我没有copy任何文件到这里到哪里什么的，不知道为什么网上会有这一步。。。but～～～在linux下面是需要做下面的这个软连接的

"""

sudo ln -s /usr/local/lib/libcrfpp.so* /usr/lib64/

"""

安装好之后，测试，进入python：

import CRFPP
# 不报错就说明没有问题了

参考与下载地址：

CRF++: Yet Another CRF toolkittaku910.github.io

国内地址：

链接:https://pan.baidu.com/s/1EuJS7gAMTSg1fZSwdY_eWg 密码:9hue

Data:

链接:https://pan.baidu.com/s/1LPEmgImmNIOHKCvm9ZibUQ 密码:saa1

训练：

## 使用CRF++0.58进行分词训练

这里就使用bakeoff2005的数据进行训练的之类的吧。另外需要看一下score是怎么定义的呃

Second International Chinese Word Segmentation Bakeoff

1、准备训练数据，将原始训练预料转变为CRF++使用的语料格式：

python make_crf_train_data.py raw_seg_train_data.txt crf_seg_train_data.txt

2、训练：

crf_learn -m 10 template msr_training.tagging4crf.utf8 crf_model

NOTE: 如果训练报错，请先详细了解一下以下的参数，这里我就限制了迭代次数，具体请自己调试

###

TODO 这个需要详细了解学习一下。

Usage: crf_learn [options] files

-f, --freq=INT use features that occuer no less than INT(default 1)

-m, --maxiter=INT set INT for max iterations in LBFGS routine(default 10k)

-c, --cost=FLOAT set FLOAT for cost parameter(default 1.0)

-e, --eta=FLOAT set FLOAT for termination criterion(default 0.0001)

-C, --convert convert text model to binary model

-t, --textmodel build also text model file for debugging

-a, --algorithm=(CRF|MIRA) select training algorithm

-p, --thread=INT number of threads (default auto-detect)

-H, --shrinking-size=INT set INT for number of iterations variable needs to be optimal before considered for shrinking. (default 20)

-v, --version show the version and exit

-h, --help show this help and exit

###

3、准备测试数据：

python make_crf_test_data.py raw_seg_test_data.txt crf_seg_test_data.txt

4、测试：

crf_test -m crf_model msr_test4crf.utf8 > msr_test4crf.tag.utf8

###

TODO：这个需要详细了解学习一下

Usage: crf_test [options] files

-m, --model=FILE set FILE for model file

-n, --nbest=INT output n-best results

-v, --verbose=INT set INT for verbose level

-c, --cost-factor=FLOAT set cost factor

-o, --output=FILE use FILE as output file

-v, --version show the version and exit

-h, --help show this help and exit

###

5、转化为plain_text。

python crf_data_2_word.py msr_test4crf.tag.utf8 msr_test4crf.tag2word.utf8

6、评估效果：

./icwb2-data/scripts/score ./icwb2-data/gold/msr_training_words.utf8 ./icwb2-data/gold/msr_test_gold.utf8 msr_test4crf.tag2word.utf8 > msr_crf_segment.score

测试结果如下：

=== SUMMARY:

=== TOTAL INSERTIONS: 1412

=== TOTAL DELETIONS: 1305

=== TOTAL SUBSTITUTIONS: 2449

=== TOTAL NCHANGE: 5166

=== TOTAL TRUE WORD COUNT: 106873

=== TOTAL TEST WORD COUNT: 106980

=== TOTAL TRUE WORDS RECALL: 0.965

=== TOTAL TEST WORDS PRECISION: 0.964

=== F MEASURE: 0.964

=== OOV Rate: 0.026

=== OOV Recall Rate: 0.647

=== IV Recall Rate: 0.974

### msr_test4crf.tag2word.utf8 1412 1305 2449 5166 106873 106980 0.965 0.964 0.964 0.026 0.647 0.974

7、实现应用。可以直接读取训练好的模型，进行分词：

python crf_segmenter.py crf_model ./icwb2-data/testing/msr_test.utf8 msr_test.seg.utf8

NOTE:

所有的代码都可以从下面的链接中找到，为了不侵权，我就不copy了。除了最后一份代码，前面的只要把print注释或者加括号就从py2到3了。最后一份代码要改点，我放在这里：

# 修改原因，主要是不需要编码和解码utf-8了。open那里也不需要，就不改了。。。麻烦
def crf_segmentor(input_file, output_file, tagger):
    input_data = codecs.open(input_file, 'r', 'utf-8')
    output_data = codecs.open(output_file, 'w', 'utf-8')
    for line in input_data.readlines():
        tagger.clear()
        for word in line.strip():
            word = word.strip()
            if word:
                tagger.add((word + "totB"))
        tagger.parse()
        size = tagger.size()
        xsize = tagger.xsize()
        for i in range(0, size):
            for j in range(0, xsize):
                char = tagger.x(i, j)
                tag = tagger.y2(i)
                if tag == 'B':
                    output_data.write(' ' + char)
                elif tag == 'M':
                    output_data.write(char)
                elif tag == 'E':
                    output_data.write(char + ' ')
                else: # tag == 'S'
                    output_data.write(' ' + char + ' ')
        output_data.write('n')
    input_data.close()
    output_data.close()

8、主要参考，脚本代码原创权归链接作者所有：

中文分词入门之字标注法4 | 我爱自然语言处理

9、后记：

分词的难点，好多都被人实现了。这里使用CRF++就是别人做好的。CRF的讲解我会在后面来。另外，实际应用中分词的难点在于后处理，即如何应对不该切的切了，该切的没切。这个我就不讲了。可以看看现成的分词器。

End