信息抽取基线系统整理

最新推荐文章于 2024-08-16 16:51:44 发布

啊布吉

最新推荐文章于 2024-08-16 16:51:44 发布

阅读量1.6k

点赞数

分类专栏：排坑路文章标签：学习机器学习

排坑路专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Information Extraction Baseline

System—InfoExtractor 信息抽取器

Abstract 摘要

InfoExtractor is an information extraction baseline system based on the Schema constrained Knowledge Extraction dataset(SKED).
是基于模式约束知识抽取数据集（SKED）的信息抽取基线系统

InfoExtractor adopt a pipeline architecture with a p-classification model and a so-labeling model which are both implemented with PaddlePaddle.
采用P分类模型和SO标记模型的流水线结构

The p-classification model is a multi-label classification which employs a stacked Bi-LSTM with max-pooling network, to identify the predicate involved in the given sentence.
P-分类模型是一个多标签分类，识别句子中的谓词（使用一个具有最大池网络的堆叠双LSTM来识别）

Then a deep Bi-LSTM-CRF network is adopted with BIEO tagging scheme in the so-labeling model to label the element of subject and object mention, given the predicate which is distinguished in the p-classification model.
然后在SO标记模型中，采用一个深度双LSTM CRF网络，采用BIEO标记方案，在P分类模型中给出了区分谓词的条件下，标记主语和宾语提及的元素。（SO关系对）

The F1 value of InfoExtractor on the development set is 0.668.
在开发数据集中的F1的值为0.688

F1 score是分类问题的一个衡量指标。一些多分类问题的机器学习竞赛，常常将F1-score作为最终测评的方法。它是精确率和召回率的调和平均数，最大为1，最小为0。
F1表达式
相关计算过程

Getting Started 入门开始

Environment Requirements 环境需求

Paddlepaddle v1.2.0

Numpy

Memory requirement 10G for training and 6G
for infering 训练需10G内存推断预测6G

Step 1: Install paddlepaddle 安装paddle

For now we’ve only tested on PaddlePaddle
Fluid v1.2.0, please install PaddlePaddle firstly and see more details about
PaddlePaddle in PaddlePaddle Homepage.

安装paddlepaddle 官方网址如上

Step 2: Download the training data, dev 下载训练数据集

data and schema files 数据及schema文件

Please download the training data,
development data and schema files from the competition
website, then unzip files and put them in
./data/folder.


cd data

unzip train_data.json.zip 

unzip dev_data.json.zip

cd -

Step 3: Get the vocabulary file 词汇表文件

Obtain high frequency words from the field ‘postag’ of training and dev data, then compose these high frequency words into a vocabulary list.
从训练数据集和dev数据集的 postag 中得到高频词汇，将这些高频词组组成词汇表


python lib/get_vocab.py
./data/train_data.json ./data/dev_data.json > ./dict/word_idx

Step 4: Train p-classification model 训练 p分类模型

First, the classification model is trained
to identify predicates in sentences. Note that if you need to change the
default hyper-parameters, e.g. hidden layer size or whether to use GPU for
training (By default, CPU training is used), etc. Please modify the specific
argument in ./conf/IE_extraction.conf, then run the following command:

经过该分类模型训练来识别句子中的谓词。如果需要改变默认的超参数，例如隐藏层大小或是否使用GPU训练（默认使用CPU训练）等，请修改具体内容./conf/IE_extraction.conf中的参数，然后再运行下面的命令。


python bin/p_classification/p_train.py
--conf_path=./conf/IE_extraction.conf

The trained p-classification model will be
saved in the folder ./model/p_model.
（p分类模型的存储文件地址）

Step 5: Train so-labeling model so标记模型

After getting the predicates that exist in
the sentence, a sequence labeling model is trained to identify the s-o pairs
corresponding to the relation that appear in the sentence.

得到句子中存在的谓词之后，该模型用来训练识别句子中存在的关系对 s-o
（s：subject 主；o：object 宾）

Before training the so-labeling model, you
need to prepare the training data that meets the training model format to train
a so-labeling model.

在训练标记模型之前，需要准备符合训练格式的训练数据来训练标记模型so-labeling model


python lib/get_spo_train.py  ./data/train_data.json >
./data/train_data.p

python lib/get_spo_train.py  ./data/dev_data.json > ./data/dev_data.p

To train a so labeling model, you can run:
训练so执行语句


python bin/so_labeling/spo_train.py
--conf_path=./conf/IE_extraction.conf

The trained so-labeling model will be saved
in the folder ./model/spo_model.
存档文件地址

Step 6: Infer with two trained models 用两个训练模型进行推断

After the training is completed, you can choose a trained model for prediction. The following command is used to predict with the last model. You can also use the development set to select the optimal model for prediction. To do inference by using two trained models with the demo est data (under ./data/test_demo.json), please execute the command in two steps:

训练结束后，能选择一个经过训练的模型来进行预测。下面是以最后一个模型进行预测的命令行语句。你也可以用dev数据集来选择用于预测的最佳模型。要使用两个经过训练的模型对demo est 数据进行推断(under ./data/test_demo.json) 分两步执行命令。


python bin/p_classification/p_infer.py
--conf_path=./conf/IE_extraction.conf --model_path=./model/p_model/final/
--predict_file=./data/test_demo.json > ./data/test_demo.p

python bin/so_labeling/spo_infer.py
--conf_path=./conf/IE_extraction.conf --model_path=./model/spo_model/final/
--predict_file=./data/test_demo.p > ./data/test_demo.res

The predicted SPO triples will be saved in
the folder ./data/test_demo.res.
预测的SPO三元组存储路径

Discussion

If you have any questions, you can submit
an issue in github and we will respond periodically.

Copyright and License

Licensed under the Apache License, Version
2.0 (the “License”); you may not use this file except in compliance
with the License. You may otain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed
to in writing, software distributed under the License is distributed on an
“AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing
permissions and limitations under the License.

APPENDIX 附录

In the released dataset, the field ‘postag’ of sentences represents the segmentation and part-of-speech tagging
information. The abbreviations of part-of-speech tagging (PosTag) and their corresponding part of speech meanings are shown in the following table.
在发布的数据集中，句子中的postag表示切分和词性标注(part-of-speech tagging)。
下表表示了postag的各缩写及其含义

In addition, the given segmentation and
part-of-speech tagging of the dataset are only references and can be replaced
with other segmentation results.

POS	Meaning
n	common nouns 普通名词
f	localizer
s	space
t	time
nr	noun of people 人物名词
ns	noun of space 空间名词
nt	noun of time 时间名词
nw	noun of work
nz	other proper noun
v	verbs 动词
vd	verb of adverbs
vn	verb of noun
a	adjective
ad	adjective of adverb
an	adnoun
d	adverbs
m	numeral 数字
q	quantity 量词
r	pronoun 代词
p	prepositions 介词
c	conjunction 连接词
u	auxiliary 助动词
xc	other function word
w	punctuations 标点