信息抽取基线系统整理

Information Extraction Baseline

System—InfoExtractor 信息抽取器

Abstract 摘要

InfoExtractor is an information extraction baseline system based on the Schema constrained Knowledge Extraction dataset(SKED).
是基于模式约束知识抽取数据集(SKED)的信息抽取基线系统

InfoExtractor adopt a pipeline architecture with a p-classification model and a so-labeling model which are both implemented with PaddlePaddle.
采用P分类模型SO标记模型的流水线结构

The p-classification model is a multi-label classification which employs a stacked Bi-LSTM with max-pooling network, to identify the predicate involved in the given sentence.
P-分类模型是一个多标签分类,识别句子中的谓词(使用一个具有最大池网络的堆叠双LSTM来识别)

Then a deep Bi-LSTM-CRF network is adopted with BIEO tagging scheme in the so-labeling model to label the element of subject and object mention, given the predicate which is distinguished in the p-classification model.
然后在SO标记模型中,采用一个深度双LSTM CRF网络,采用BIEO标记方案,在P分类模型中给出了区分谓词的条件下,标记主语和宾语提及的元素。(SO关系对)

The F1 value of InfoExtractor on the development set is 0.668.
在开发数据集中的F1的值为0.688

F1 score是 分类问题的一个衡量指标。一些多分类问题的机器学习竞赛,常常将F1-score作为最终测评的方法。它是精确率和召回率的调和平均数,最大为1,最小为0。
F1表达式
相关计算过程

Getting Started 入门开始

Environment Requirements 环境需求

Paddlepaddle v1.2.0

Numpy

Memory requirement 10G for training and 6G
for infering 训练需10G内存 推断预测6G

Step 1: Install paddlepaddle 安装paddle

For now we’ve only tested on PaddlePaddle
Fluid v1.2.0, please install PaddlePaddle firstly and see more details about
PaddlePaddle in PaddlePaddle Homepage.

安装paddlepaddle 官方网址如上

Step 2: Download the training data, dev 下载训练数据集

data and schema files 数据及schema文件

Please download the training data,
development data and schema files from the competition
website
, then unzip files and put them in
./data/folder.


cd data

unzip train_data.json.zip 

unzip dev_data.json.zip

cd -

Step 3: Get the vocabulary file 词汇表文件

Obtain high frequency words from the field ‘postag’ of training and dev data, then compose these high frequency words into a vocabulary list.
从训练数据集和dev数据集的 postag 中得到高频词汇,将这些高频词组组成词汇表


python lib/get_vocab.py
./data/train_data.json ./data/dev_data.json > ./dict/word_idx

Step 4: Train p-classification model 训练 p分类 模型

First, the classification model is trained
to identify predicates in sentences. Note that if you need to change the
default hyper-parameters, e.g. hidden layer size or whether to use GPU for
training (By default, CPU training is used), etc. Please modify the specific
argument in ./conf/IE_extraction.conf, then run the following command:

经过该分类模型训练 来识别句子中的谓词。如果需要改变默认的超参数,例如隐藏层大小或是否使用GPU训练(默认使用CPU训练)等,请修改具体内容./conf/IE_extraction.conf中的参数,然后再运行下面的命令。


python bin/p_classification/p_train.py
--conf_path=./conf/IE_extraction.conf

The trained p-classification model will be
saved in the folder ./model/p_model.
(p分类模型的存储文件地址)

Step 5: Train so-labeling model so标记模型

After getting the predicates that exist in
the sentence, a sequence labeling model is trained to identify the s-o pairs
corresponding to the relation that appear in the sentence.

得到句子中存在的谓词之后,该模型用来训练识别句子中存在的关系对 s-o
(s:subject 主;o:object 宾)

Before training the so-labeling model, you
need to prepare the training data that meets the training model format to train
a so-labeling model.

在训练标记模型之前,需要准备符合训练格式的训练数据来 训练标记模型so-labeling model


python lib/get_spo_train.py  ./data/train_data.json >
./data/train_data.p

python lib/get_spo_train.py  ./data/dev_data.json > ./data/dev_data.p

To train a so labeling model, you can run:
训练so执行语句


python bin/so_labeling/spo_train.py
--conf_path=./conf/IE_extraction.conf

The trained so-labeling model will be saved
in the folder ./model/spo_model.
存档文件地址

Step 6: Infer with two trained models 用两个训练模型进行推断

After the training is completed, you can choose a trained model for prediction. The following command is used to predict with the last model. You can also use the development set to select the optimal model for prediction. To do inference by using two trained models with the demo est data (under ./data/test_demo.json), please execute the command in two steps:

训练结束后,能选择一个经过训练的模型来进行预测。下面是以最后一个模型进行预测的命令行语句。你也可以用dev数据集来选择 用于预测的最佳模型。要使用两个经过训练的模型对demo est 数据进行推断(under ./data/test_demo.json) 分两步执行命令。


python bin/p_classification/p_infer.py
--conf_path=./conf/IE_extraction.conf --model_path=./model/p_model/final/
--predict_file=./data/test_demo.json > ./data/test_demo.p

python bin/so_labeling/spo_infer.py
--conf_path=./conf/IE_extraction.conf --model_path=./model/spo_model/final/
--predict_file=./data/test_demo.p > ./data/test_demo.res

The predicted SPO triples will be saved in
the folder ./data/test_demo.res.
预测的SPO三元组存储路径

Discussion

If you have any questions, you can submit
an issue in github and we will respond periodically.

Copyright and License

Copyright 2019 Baidu.com, Inc. All Rights
Reserved

Licensed under the Apache License, Version
2.0 (the “License”); you may not use this file except in compliance
with the License. You may otain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed
to in writing, software distributed under the License is distributed on an
“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing
permissions and limitations under the License.

APPENDIX 附录

In the released dataset, the field ‘postag’ of sentences represents the segmentation and part-of-speech tagging
information. The abbreviations of part-of-speech tagging (PosTag) and their corresponding part of speech meanings are shown in the following table.
在发布的数据集中,句子中的postag表示切分和词性标注(part-of-speech tagging)。
下表表示了postag的各缩写及其含义

In addition, the given segmentation and
part-of-speech tagging of the dataset are only references and can be replaced
with other segmentation results.

POSMeaning
ncommon nouns 普通名词
flocalizer
sspace
ttime
nrnoun of people 人物名词
nsnoun of space 空间名词
ntnoun of time 时间名词
nwnoun of work
nzother proper noun
vverbs 动词
vdverb of adverbs
vnverb of noun
aadjective
adadjective of adverb
anadnoun
dadverbs
mnumeral 数字
qquantity 量词
rpronoun 代词
pprepositions 介词
cconjunction 连接词
uauxiliary 助动词
xcother function word
wpunctuations 标点
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值