记使用CRF++做中文命名实体识别

最新推荐文章于 2024-07-16 08:47:32 发布

Yumath

最新推荐文章于 2024-07-16 08:47:32 发布

阅读量4.9k

点赞数

分类专栏： NLP

NLP 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

刚开始接触CRF++，难免有点摸不着头脑。
詹老师说的对，阅读文档就应该去看英文原版，看中文译过来的博客难免会丢失信息
CRF++主页：https://taku910.github.io/crfpp/
CRF++-0.58.tar.gz下载：http://code.google.com/p/crfpp/downloads/list
tips：梯子自备

Installation

% ./configure
% make
% su
# make install
You can change default install path by using –prefix option of configure script.
Try –help option for finding out other options.

Training and Test file formats

训练文件和测试文件都需要写成特定的格式才能正常运行。Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. The definition of tokens depends on tasks, however, in most of typical cases, they simply correspond to words. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. To identify the boundary between sentences, an empty line is put.
你可以想给几列就给几列，前提是对于所有的token，列的数量必须固定。 Furthermore, there are some kinds of “semantics” among the columns. For example, 1st column is ‘word’, second column is ‘POS tag’ third column is ‘sub-category of POS’ and so on.
最后一列，代表真正要被CRF++训练的标记

注：特征模板有空再补上

Training (encoding)

% crf_learn template_file train_file model_file

template_file 和 train_file 都是需要你提前准备好的。crf_learn 训练的模型存在 model_file中。
这里写图片描述
- iter: 迭代次数
- terr: error rate with respect to tags. (# of error tags/# of all tag)
- serr: error rate with respect to sentences. (# of error sentences/#
of all sentences)
- obj: current object value. When this value converges to a fixed
point, CRF++ stops the iteration.
- diff: relative difference from the previous object value.

注：训练参数选择有空补

Testing (decoding)

% crf_test -m model_file test_files > result.txt

Evaluate

conlleval地址：http://www.cnts.ua.ac.be/conll2000/chunking/output.html

conlleval.pl -d "\t" < result.txt

这里写图片描述

注：根据人民日报标注语料训练的例子，周末有空再编辑

Yumath

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
记使用CRF++做中文命名实体识别

CRF++
复制链接

扫一扫

专栏目录