python-crfsuite API 中文

最新推荐文章于 2024-05-17 09:54:21 发布

power0405hf

最新推荐文章于 2024-05-17 09:54:21 发布

阅读量1.3w

点赞数 7

分类专栏： python nlp 文章标签： python crfsuite

python 同时被 2 个专栏收录

58 篇文章 1 订阅

订阅专栏

nlp

5 篇文章 0 订阅

订阅专栏

python-crfsuite API 原文

1. class pycrfsuite.ItemSequence

crfsuite ItemSequence的一个封装，是在单个序列中用来保存所有项目的特征的。
使用这个类可以直接传数据到Trainer或者Tagger中。
通过使用这个类可以节省一些时间，如果相同的输入序列传递给Trainer/Tagger不止一次——特征不会多次被processed
它还允许“processed”特征/属性发送CRFsuite——它们也许会有所帮助,如检查哪些属性(由return()返回)对于一个给定的观察是可用的。
用一个序列的特征来初始化ItemSequence
ItemSequence([{‘foo’:1,’bar’:0},{‘foo’:1.5,’baz’:2}])

项目特征可以是以下几种形式：

{“string_key”:float_weight,…}:dict（字典类型）,特征：权重
{“string_key”:bool,…}:dict，True的权重是1.0，Flase的权重是0.0
{“string_key”: “string_value”, …} = {“string_key=string_value”: 1.0, …}字典
[“string_key1”, “string_key2”, …] list类型，= {“string_key1”: 1.0, “string_key2”: 1.0, …}
{“string_prefix”: {…}} dicts:nested dict(嵌套字典) is processed, “string_prefix” s prepended to each key.(是每个键的前缀)
{“string_prefix”: set([…])} dicts： nested list is processed and “string_prefix” s prepended to each key.
基于字典的特征可以混合使用如下：

{"key1": float_weight,
 "key2": "string_value",
 "key3": bool_value,
 "key4": {"key5": ["x", "y"], "key6": float_value},
 }

2 items(self)

返回一个准备好的项目列表：a list of {unicode_key:float_value}dicts
如：

print  ItemSequence([["foo"],{"bar":{"baz":1}}]).items()
输出：[{u'foo': 1.0}, {u'bar:baz': 1.0}]

2. Training

1. class pycrfsuite.Trainer

这个类包含了数据集用来训练，对许多训练算法提供了一个接口
参数:

algorithm：{‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, ‘arow’}
    算法的名字，见Train.select()
params:dict,optional
    训练参数，见Trainer.set_params()和Trainer.set()
verbose:boolean
    是否在训练时输出debug信息，默认是True

append(self,xseq,yseq,int group=0)
添加一个实例(项/标签序列)到数据集。
parameters:xseq:a sequence of item features,特征项的序列
                该实例xseq序列应该是特征项的列表或者是 ItemSequence实例。
                允许特征项的格式和ItemSequence文档里的一样。
           yseq： a sequence of strings,一个字符串序列
                实例的标签序列。元素数量必须与xseq中的一一对应
           group:int,optional[可选]
                实例的组数，组数用来选择子集的数据。
clear(self)
    移除数据集中的所有实例
get(self,name)
    得到训练参数的值。这个函数得到Trainer.select()选择的图形模型和训练算法的参数值
    parameter：name：string
get_params(self)
    得到训练参数
    返回值：dict
            一个字典，形式为所有参数的{parameter_name: parameter_value}、
help(self,name)
    得到训练参数的描述。该函数可获的name确定的参数的帮助信息。图形模型和训练算法必须通过Trainer.select()被选中，才能使用help
Parameters:name:string 参数名
Returns：string 参数描述

logparser=None

message(self,message)

on_end(self,log)

on_featgen_end(self,log)

on_featgen_progress(self,log,percent)

on_iteration(self,log,info)

on_optimization_end(self,log)

on_prepare_error(self,log)

on_prepared(self,log)

on_start(self,log)

params(self)
    获得参数列表
    该函数返回可用参数名的列表（对Trainer构造函数或者Trainer.select()选择的图形模型和训练算法）

select(self,algorithm,type='crf1d')
    初始化训练算法
    Parameters:algorithm: {‘lbfgs’, ‘l2sgd’, ‘ap’, ‘pa’, ‘arow’}
        ‘lbfgs’ for Gradient descent using the L-BFGS method,(拟牛顿法的一种)
        ‘l2sgd’ for Stochastic Gradient Descent with L2 regularization term（随机梯度下降，L2范数规则化）
        ‘ap’ for Averaged Perceptron（平均感知算法）
        ‘pa’ for Passive Aggressive（被动攻击？？什么鬼）
        ‘arow’ for Adaptive Regularization Of Weight Vector（权向量的自适应正则化）
    type：string ,optional
        graphical model的名字
set(self,name,value)
    设置一个训练参数。该函数对Trainer.select()的graphical model 和训练算法设置一个参数值
    Parameters:name:string
                    参数名
                value:string
                    参数值
set_params(self,params)
    设置一系列的训练参数
    Parameters：params:dict
                    一个参数字典：{name: value}

train(self,model,int holdout=-1)
    运行训练算法。该函数将Trainer.append()给定的数据集带入，进行训练。
    Parameters：model：string
                训练模式保存的文件名，如果为空，则该函数不写出一个模型文件。
                holdout：int ，optional
                坚持评价的组号。这个组数的实例将不会用于培训,但是对于抵抗评估。默认值为1,这意味着“使用所有实例训练”。

verbose
    verbose:object

3.Tagging

class pycrfsuite.Tagger
    tagger类
    该类通过模型为输入序列产生预测标签。
close(self)
    关闭模型
dump(self,filename=None)
    转储CRF模型为纯文本格式。
info(self)
    返回一个 parsedDump的内部信息结构模型。

label(self)
    得到标签序列
    retrurns：list of strings
            the list of labels in the model.

marginal(self,y,pos)
    计算当前输入序列(如使用Tagger.set()方法的得到的一组序列或使用以前Tagger.tag()调用的一个序列)标签y在位置pos的边缘概率。
    parameters: y :string
                    标签
                t :int
                    标签的位置
    returns：float
                概率：P(yseq|xseq).

set(self,xseq)
    设置一个实例（特征项）以备调用Tagger.tag(),tagger.probability()和Tagger.marginal()方法
    Parameters：xseq：item sequence
                    实例xseq的序列项应该是一个特征项的列表或一个ItemSequence实例的列表。

                returns：list of strings
                    预测的标签序列。

4. Debugging

class pycrfsuite._dumpparser.ParsedDump
CRFsuite模型参数。这个类型的对象返回pycrfsuite.Tagger.info()方法。

Attributes
transitions	(dict) {(from_label, to_label): weight} dict with learned transition weights
state_features	(dict) {(attribute, label): weight} dict with learned (attribute, label) weights
header	(dict) Metadata from the file header
labels	(dict) {name: internal_id} dict with model labels
attributes	(dict) {name: internal_id} dict with known attributes

power0405hf

关注

7
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
python-crfsuite API 中文

python-crfsuite API 原文1. class pycrfsuite.ItemSequencecrfsuite ItemSequence的一个封装，是在单个序列中用来保存所有项目的特征的。使用这个类可以直接传数据到Trainer或者Tagger中。通过使用这个类可以节省一些时间，如果相同的输入序列传递给Trainer/Tagger不止一次——特征不会多次被processed
复制链接

扫一扫

专栏目录