Information Extraction Baseline
System—InfoExtractor 信息抽取器
Abstract 摘要
InfoExtractor is an information extraction baseline system based on the Schema constrained Knowledge Extraction dataset(SKED).
是基于模式约束知识抽取数据集(SKED)的信息抽取基线系统
InfoExtractor adopt a pipeline architecture with a p-classification model and a so-labeling model which are both implemented with PaddlePaddle.
采用P分类模型和SO标记模型的流水线结构
The p-classification model is a multi-label classification which employs a stacked Bi-LSTM with max-pooling network, to identify the predicate involved in the given sentence.
P-分类模型是一个多标签分类,识别句子中的谓词(使用一个具有最大池网络的堆叠双LSTM来识别)
Then a deep Bi-LSTM-CRF network is adopted with BIEO tagging scheme in the so-labeling model to label the element of subject and object mention, given the predicate which is distinguished in the p-classification model.
然后在SO标记模型中,采用一个深度双LSTM CRF网络,采用BIEO标记方案,在P分类模型中给出了区分谓词的条件下,标记主语和宾语提及的元素。(SO关系对)
The F1 value of InfoExtractor on the development set is 0.668.
在开发数据集中的F1的值为0.688
F1 score是 分类问题的一个衡量指标。一些多分类问题的机器学习竞赛,常常将F1-score作为最终测评的方法。它是精确率和召回率的调和平均数,最大为1,最小为0。
Getting Started 入门开始
Environment Requirements 环境需求
Paddlepaddle v1.2.0
Numpy
Memory requirement 10G for training and 6G
for infering 训练需10G内存 推断预测6G
Step 1: Install paddlepaddle 安装paddle
For now we’ve only tested on PaddlePaddle
Fluid v1.2.0, please install PaddlePaddle firstly and see more details about
PaddlePaddle in PaddlePaddle Homepage.
安装paddlepaddle 官方网址如上
Step 2: Download the training data, dev 下载训练数据集
data and schema files 数据及schema文件
Please download the training data,
development data and schema files from the competition
website, then unzip files and put them in
./data/
folder.
cd data
unzip train_data.json.zip
unzip dev_data.json.zip
cd -
Step 3: Get the vocabulary file 词汇表文件
Obtain high frequency words from the field ‘postag’ of training and dev data, then compose these high frequency words into a vocabulary list.
从训练数据集和dev数据集的 postag 中得到高频词汇,将这些高频词组组成词汇表
python lib/get_vocab.py
./data/train_data.json ./data/dev_data.json > ./dict/word_idx
Step 4: Train p-classification model 训练 p分类 模型
First, the classification model is trained
to identify predicates in sentences. Note that if you need to change the
default hyper-parameters, e.g. hidden layer size or whether to use GPU for
training (By default, CPU training is used), etc. Please modify the specific
argument in ./conf/IE_extraction.conf
, then run the following command:
经过该分类模型训练 来识别句子中的谓词。如果需要改变默认的超参数,例如隐藏层大小或是否使用GPU训练(默认使用CPU训练)等,请修改具体内容./conf/IE_extraction.conf
中的参数,然后再运行下面的命令。
python bin/p_classification/p_train.py
--conf_path=./conf/IE_extraction.conf
The trained p-classification model will be
saved in the folder ./model/p_model
.
(p分类模型的存储文件地址)
Step 5: Train so-labeling model so标记模型
After getting the predicates that exist in
the sentence, a sequence labeling model is trained to identify the s-o pairs
corresponding to the relation that appear in the sentence.
得到句子中存在的谓词之后,该模型用来训练识别句子中存在的关系对 s-o
(s:subject 主;o:object 宾)
Before training the so-labeling model, you
need to prepare the training data that meets the training model format to train
a so-labeling model.
在训练标记模型之前,需要准备符合训练格式的训练数据来 训练标记模型so-labeling model
python lib/get_spo_train.py ./data/train_data.json >
./data/train_data.p
python lib/get_spo_train.py ./data/dev_data.json > ./data/dev_data.p
To train a so labeling model, you can run:
训练so执行语句
python bin/so_labeling/spo_train.py
--conf_path=./conf/IE_extraction.conf
The trained so-labeling model will be saved
in the folder ./model/spo_model
.
存档文件地址
Step 6: Infer with two trained models 用两个训练模型进行推断
After the training is completed, you can choose a trained model for prediction. The following command is used to predict with the last model. You can also use the development set to select the optimal model for prediction. To do inference by using two trained models with the demo est data (under ./data/test_demo.json
), please execute the command in two steps:
训练结束后,能选择一个经过训练的模型来进行预测。下面是以最后一个模型进行预测的命令行语句。你也可以用dev数据集来选择 用于预测的最佳模型。要使用两个经过训练的模型对demo est 数据进行推断(under ./data/test_demo.json
) 分两步执行命令。
python bin/p_classification/p_infer.py
--conf_path=./conf/IE_extraction.conf --model_path=./model/p_model/final/
--predict_file=./data/test_demo.json > ./data/test_demo.p
python bin/so_labeling/spo_infer.py
--conf_path=./conf/IE_extraction.conf --model_path=./model/spo_model/final/
--predict_file=./data/test_demo.p > ./data/test_demo.res
The predicted SPO triples will be saved in
the folder ./data/test_demo.res
.
预测的SPO三元组存储路径
Discussion
If you have any questions, you can submit
an issue in github and we will respond periodically.
Copyright and License
Copyright 2019 Baidu.com, Inc. All Rights
Reserved
Licensed under the Apache License, Version
2.0 (the “License”); you may not use this file except in compliance
with the License. You may otain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed
to in writing, software distributed under the License is distributed on an
“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing
permissions and limitations under the License.
APPENDIX 附录
In the released dataset, the field ‘postag’ of sentences represents the segmentation and part-of-speech tagging
information. The abbreviations of part-of-speech tagging (PosTag) and their corresponding part of speech meanings are shown in the following table.
在发布的数据集中,句子中的postag表示切分和词性标注(part-of-speech tagging)。
下表表示了postag的各缩写及其含义
In addition, the given segmentation and
part-of-speech tagging of the dataset are only references and can be replaced
with other segmentation results.
POS | Meaning |
---|---|
n | common nouns 普通名词 |
f | localizer |
s | space |
t | time |
nr | noun of people 人物名词 |
ns | noun of space 空间名词 |
nt | noun of time 时间名词 |
nw | noun of work |
nz | other proper noun |
v | verbs 动词 |
vd | verb of adverbs |
vn | verb of noun |
a | adjective |
ad | adjective of adverb |
an | adnoun |
d | adverbs |
m | numeral 数字 |
q | quantity 量词 |
r | pronoun 代词 |
p | prepositions 介词 |
c | conjunction 连接词 |
u | auxiliary 助动词 |
xc | other function word |
w | punctuations 标点 |