刚开始接触CRF++,难免有点摸不着头脑。
詹老师说的对,阅读文档就应该去看英文原版,看中文译过来的博客难免会丢失信息
CRF++主页:https://taku910.github.io/crfpp/
CRF++-0.58.tar.gz下载:http://code.google.com/p/crfpp/downloads/list
tips:梯子自备
Installation
% ./configure
% make
% su
# make install
You can change default install path by using –prefix option of configure script.
Try –help option for finding out other options.
Training and Test file formats
训练文件和测试文件都需要写成特定的格式才能正常运行。Generally speaking, training and test file must consist of multiple tokens. In addition, a token consists of multiple (but fixed-numbers) columns. The definition of tokens depends on tasks, however, in most of typical cases, they simply correspond to words. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a sentence. To identify the boundary between sentences, an empty line is put.
你可以想给几列就给几列,前提是对于所有的token,列的数量必须固定。 Furthermore, there are some kinds of “semantics” among the columns. For example, 1st column is ‘word’, second column is ‘POS tag’ third column is ‘sub-category of POS’ and so on.
最后一列,代表真正要被CRF++训练的标记
注:特征模板有空再补上
Training (encoding)
% crf_learn template_file train_file model_file
template_file 和 train_file 都是需要你提前准备好的。crf_learn 训练的模型存在 model_file中。
- iter: 迭代次数
- terr: error rate with respect to tags. (# of error tags/# of all tag)
- serr: error rate with respect to sentences. (# of error sentences/#
of all sentences)
- obj: current object value. When this value converges to a fixed
point, CRF++ stops the iteration.
- diff: relative difference from the previous object value.
注:训练参数选择有空补
Testing (decoding)
% crf_test -m model_file test_files > result.txt
Evaluate
conlleval地址:http://www.cnts.ua.ac.be/conll2000/chunking/output.html
conlleval.pl -d "\t" < result.txt
注:根据人民日报标注语料训练的例子,周末有空再编辑