java opennlp_java-如何在Opennlp中训练Chunker?

我需要在Opennlp中训练Chunker,以将训练数据分类为名词短语.我该如何进行?在线文档没有解释如何在没有命令行的情况下执行此操作,该文档已包含在程序中.它说使用en-chunker.train,但是如何制作该文件?

编辑:@Alaye

运行您在答案中给出的代码后,出现以下无法修复的错误:

Indexing events using cutoff of 5

Computing event counts... done. 3 events

Dropped event B-NP:[w_2=bos, w_1=bos, w0=He, w1=reckons, w2=., w_1=bosw0=He, w0=Hew1=reckons, t_2=bos, t_1=bos, t0=PRP, t1=VBZ, t2=., t_2=bost_1=bos, t_1=bost0=PRP, t0=PRPt1=VBZ, t1=VBZt2=., t_2=bost_1=bost0=PRP, t_1=bost0=PRPt1=VBZ, t0=PRPt1=VBZt2=., p_2=bos, p_1=bos, p_2=bosp_1=bos, p_1=bost_2=bos, p_1=bost_1=bos, p_1=bost0=PRP, p_1=bost1=VBZ, p_1=bost2=., p_1=bost_2=bost_1=bos, p_1=bost_1=bost0=PRP, p_1=bost0=PRPt1=VBZ, p_1=bost1=VBZt2=., p_1=bost_2=bost_1=bost0=PRP, p_1=bost_1=bost0=PRPt1=VBZ, p_1=bost0=PRPt1=VBZt2=., p_1=bosw_2=bos, p_1=bosw_1=bos, p_1=bosw0=He, p_1=bosw1=reckons, p_1=bosw2=., p_1=bosw_1=bosw0=He, p_1=bosw0=Hew1=reckons]

Dropped event B-VP:[w_2=bos, w_1=He, w0=reckons, w1=., w2=eos, w_1=Hew0=reckons, w0=reckonsw1=., t_2=bos, t_1=PRP, t0=VBZ, t1=., t2=eos, t_2=bost_1=PRP, t_1=PRPt0=VBZ, t0=VBZt1=., t1=.t2=eos, t_2=bost_1=PRPt0=VBZ, t_1=PRPt0=VBZt1=., t0=VBZt1=.t2=eos, p_2=bos, p_1=B-NP, p_2=bosp_1=B-NP, p_1=B-NPt_2=bos, p_1=B-NPt_1=PRP, p_1=B-NPt0=VBZ, p_1=B-NPt1=., p_1=B-NPt2=eos, p_1=B-NPt_2=bost_1=PRP, p_1=B-NPt_1=PRPt0=VBZ, p_1=B-NPt0=VBZt1=., p_1=B-NPt1=.t2=eos, p_1=B-NPt_2=bost_1=PRPt0=VBZ, p_1=B-NPt_1=PRPt0=VBZt1=., p_1=B-NPt0=VBZt1=.t2=eos, p_1=B-NPw_2=bos, p_1=B-NPw_1=He, p_1=B-NPw0=reckons, p_1=B-NPw1=., p_1=B-NPw2=eos, p_1=B-NPw_1=Hew0=reckons, p_1=B-NPw0=reckonsw1=.]

Dropped event O:[w_2=He, w_1=reckons, w0=., w1=eos, w2=eos, w_1=reckonsw0=., w0=.w1=eos, t_2=PRP, t_1=VBZ, t0=., t1=eos, t2=eos, t_2=PRPt_1=VBZ, t_1=VBZt0=., t0=.t1=eos, t1=eost2=eos, t_2=PRPt_1=VBZt0=., t_1=VBZt0=.t1=eos, t0=.t1=eost2=eos, p_2B-NP, p_1=B-VP, p_2B-NPp_1=B-VP, p_1=B-VPt_2=PRP, p_1=B-VPt_1=VBZ, p_1=B-VPt0=., p_1=B-VPt1=eos, p_1=B-VPt2=eos, p_1=B-VPt_2=PRPt_1=VBZ, p_1=B-VPt_1=VBZt0=., p_1=B-VPt0=.t1=eos, p_1=B-VPt1=eost2=eos, p_1=B-VPt_2=PRPt_1=VBZt0=., p_1=B-VPt_1=VBZt0=.t1=eos, p_1=B-VPt0=.t1=eost2=eos, p_1=B-VPw_2=He, p_1=B-VPw_1=reckons, p_1=B-VPw0=., p_1=B-VPw1=eos, p_1=B-VPw2=eos, p_1=B-VPw_1=reckonsw0=., p_1=B-VPw0=.w1=eos]

Indexing... done.

Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

at java.util.ArrayList.rangeCheck(ArrayList.java:653)

at java.util.ArrayList.get(ArrayList.java:429)

at opennlp.tools.ml.model.AbstractDataIndexer.sortAndMerge(AbstractDataIndexer.java:89)

at opennlp.tools.ml.model.TwoPassDataIndexer.(TwoPassDataIndexer.java:105)

at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)

at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)

at opennlp.tools.ml.model.TrainUtil.train(TrainUtil.java:53)

at opennlp.tools.chunker.ChunkerME.train(ChunkerME.java:253)

at com.oracle.crm.nlp.CustomChunker2.main(CustomChunker2.java:91)

Sorting and merging events... Process exited with exit code 1.

(我的en-chunker.train仅包含示例数据集的前2行和最后一行.)

您能否告诉我为什么会这样以及如何解决?

EDIT2:我让Chunker工作,但是当我将训练集中的句子更改为您在回答中给出的句子之外的任何句子时,它给出了一个错误.你能告诉我为什么会这样吗?

解决方法:

训练数据样本句:

He PRP B-NP

reckons VBZ B-VP

the DT B-NP

current JJ I-NP

account NN I-NP

deficit NN I-NP

will MD B-VP

narrow VB I-VP

to TO B-PP

only RB B-NP

# # I-NP

1.8 CD I-NP

billion CD I-NP

in IN B-PP

September NNP B-NP

. . O

这样可以制作en-chunk.train文件,并可以使用CLI创建相应的.bin文件:

$opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding

或使用API

public class SentenceTrainer {

public static void trainModel(String inputFile, String modelFile)

throws IOException {

Objects.nonNull(inputFile);

Objects.nonNull(modelFile);

MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(

new File(inputFile));

Charset charset = Charset.forName("UTF-8");

ObjectStream lineStream =

new PlainTextByLineStream(new FileInputStream("en-chunker.train"),charset);

ObjectStream sampleStream = new ChunkSampleStream(lineStream);

ChunkerModel model;

try {

model = ChunkerME.train("en", sampleStream,

new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());

}

finally {

sampleStream.close();

}

OutputStream modelOut = null;

try {

modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));

model.serialize(modelOut);

} finally {

if (modelOut != null)

modelOut.close();

}

}

}

主要方法是:

public class Main {

public static void main(String args[]) throws IOException {

String inputFile = "//path//to//data.train";

String modelFile = "//path//to//.bin";

SentenceTrainer.trainModel(inputFile, modelFile);

}

}

参考:这个blog

希望这可以帮助!

PS:按上述方式在.txt文件中收集/写入数据,然后使用.train扩展名将其重命名,甚至trainingdata.txt都可以使用.这就是您制作.train文件的方式.

标签:training-data,opennlp,java,text-chunking

来源: https://codeday.me/bug/20191026/1939385.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值