菜鸟笔记-信息抽取模型UIE代码学习-数据准备

最新推荐文章于 2024-06-01 12:00:00 发布

青萍之默

最新推荐文章于 2024-06-01 12:00:00 发布

阅读量2k

点赞数 6

分类专栏：信息抽取论文阅读笔记文章标签：学习 python 深度学习

本文链接：https://blog.csdn.net/wmq104/article/details/129613296

版权

信息抽取论文阅读笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

数据准备

说明：UIE是基于Prompt的通用信息抽取框架，本文为个人学习UIE代码的笔记，学的过程中简单翻译了一下数据准备部分readme（顺序按实际处理过程有所调整），自己添加的说明都在引用格式里。

论文：Unified Structure Generation for Universal Information Extraction。

源代码：https://github.com/universal-ie/UIE

本内容所在文件夹：dataset_processing

目录结构：

.
├── converted_data/   # Final converted datasets
├── data/             # Raw data
├── data_config/      # Dataset config
├── README.md
├── run_data_generation.bash  # Convert all datasets
├── run_sample.bash           # Sample low-resource datasets
├── scripts/                  # Scripts for preprocessing
├── uie_convert.py            # Main Python File
└── universal_ie/             # Code for preprocessing

数据集预处理

使用了以下前人所做的数据预处理工作：

Dataset	Preprocessing
ACE04	mrc-for-flat-nested-ner
ACE05	mrc-for-flat-nested-ner
ACE05-Rel	sincere
CoNLL 04	sincere
NYT	JointER
SCIERC	dygiepp
ACE05-Evt	OneIE
CASIE	CASIE, Our preprocessing code see here.
14lap	BARTABSA
l4res	BARTABSA
15res	BARTABSA
16res	BARTABSA

ABSA

git clone https://github.com/yhcc/BARTABSA data/BARTABSA
mv data/BARTABSA/data data/absa

没有问题，按照原文可以下载，如果git clone网速过慢，可以直接去下载BARTABSA库，再从里面把数据拷贝到相应目录。

Entity（实体抽取）

# CoNLL03 这个下下来直接放在data/conll03文件夹下
mkdir data/conll03
wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.train -P data/conll03
wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testa -P data/conll03
wget https://raw.githubusercontent.com/synalp/NER/master/corpus/CoNLL-2003/eng.testb -P data/conll03

# gdown >= 4.4.0
pip install -U gdown
mkdir data/mrc_ner
# ACE04
gdown 1U-hGOgLmdqudsRdKIGles1-QrNJ7SSg6 -O data/mrc_ner/ace2004.tar.gz
tar zxvf data/mrc_ner/ace2004.tar.gz -C data/mrc_ner

# ACE05
gdown 1iodaJ92dTAjUWnkMyYm8aLEi5hj3cseY -O data/mrc_ner/ace2005.tar.gz
tar zxvf data/mrc_ner/ace2005.tar.gz -C data/mrc_ner

CoNLL03我自己的电脑下不下来，是通过aistudio（百度ai平台，在平台上运行上述代码）下载后，又从aistudio中下载的；

ACE04、ACE05自己电脑、aistudio都下不下来，gdown被q了没办法，自己买太贵（好像1500刀）放弃。

Relation（关系抽取）

NYT

mkdir data/NYT-multi
wget -P data/NYT-multi https://raw.githubusercontent.com/yubowen-ph/JointER/master/dataset/NYT-multi/data/train.json
wget -P data/NYT-multi https://raw.githubusercontent.com/yubowen-ph/JointER/master/dataset/NYT-multi/data/dev.json
wget -P data/NYT-multi https://raw.githubusercontent.com/yubowen-ph/JointER/master/dataset/NYT-multi/data/test.json

自己电脑运行指令没连上，通过aistudio运行上述命令下的，也可以下载JointER代码下载后从中找到该数据，并拷贝到data/NYT-multi文件夹。

CoNLL04/ACE05-rel

使用了来自 sincere 库的预处理方法，并将数据处理为相同的格式。使用下述结构来预处理 CoNLL04 和 ACE05-rel 数据集，将其放置于 data/sincere/目录下。

 $ tree data/sincere/
data/sincere/
├── ace05.json
└── conll04.json

conll04可以通过下载 sincere 库下下来，ace05-rel放弃。

将 conll/ace05-rel 转化为 sincere 的格式

python scripts/sincere_processing.py

PS：由于没有下载到ACE05-rel，因此需要将scripts/sincere_processing.py中的第26行中的’ace05’删除，如下图所示：

代码操作：将一个文件划分成train、test、dev三个，保存格式改成每行一个样本，样本关键字[‘relations’]改为[]‘span_pair_list’]，[‘entities’]改为[‘span_list’]，每个entity的结束索引（[‘end’]）减1。

运行后在data/relation/conll04目录能够看到转换后的数据。

SciERC

首先使用来自 DyGIE 库对于SciERC的预处理代码。请将在dygiepp 的 collated_data目录下的处理过的数据集放到如下目录

ps:这个里面引用的数据挺多，回头需要数据可以来看看

$ tree data/dygiepp
data/dygiepp
└── scierc
    ├── dev.json
    ├── test.json
    └── train.json

这个要先下载dygiepp 的代码，然后通过代码中bash scripts/data/get_scierc.sh下载并处理数据，我这还是wget连接失败，直接在get_scierc.sh中复制下载连接。

http://nlp.cs.washington.edu/sciIE/data/sciERC_raw.tar.gz
http://nlp.cs.washington.edu/sciIE/data/sciERC_processed.tar.gz

第一个直接下，第二个600M+用迅雷下，500kb/s+。

接下来将文件拷贝到data/scierc，运行scripts/data/get_scierc.sh后面部分

out_dir=data/scierc
mkdir $out_dir

# Decompress.
tar -xf $out_dir/sciERC_raw.tar.gz -C $out_dir
tar -xf $out_dir/sciERC_processed.tar.gz -C $out_dir

# Normalize by adding dataset name.
python scripts/data/shared/normalize.py \
    $out_dir/processed_data/json \
    $out_dir/normalized_data/json \
    --file_extension=json \
    --max_tokens_per_doc=0 \
    --dataset=scierc

# Collate for more efficient non-coref training.
python scripts/data/shared/collate.py \
    $out_dir/processed_data/json \
    $out_dir/collated_data/json \
    --file_extension=json \
    --dataset=scierc

然后将scierc的格式转化为与CoNLL04/ACE05-Rel相同。

mkdir -p data/relation/scierc
python scripts/scierc_processing.py

Event（事件抽取）

ACE05-Evt

ACE05-Evt 的预处理代码使用了 OneIE 的，请使用以下指令并将预处理后的数据集放在 data/oneie：

## OneIE Preprocessing, ACE_DATA_FOLDER -> ace_2005_td_v7
$ tree data/oneie/ace05-EN 
data/oneie/ace05-EN
├── dev.oneie.json
├── english.json
├── english.oneie.json
├── test.oneie.json
└── train.oneie.json

提示:

实验中使用了nltk==3.5 ，使用nltk==3.6+ 可能会导致句子数量变化。

没有搞到这个数据

CASIE

casie的预处理代码在 data/casie目录.请使用以下指令进行处理

cd data/casie
bash scripts/download_data.bash
bash scripts/download_corenlp.bash
bash run.bash
cd ../../

直接运行失败了，看脚本依次处理：

scripts/download_data.bash 先下载了CASIE库，并将数据找到拷贝出来，并用命令行删除file_id为999 10001 10002的三个包含某些错误的文件。

scripts/download_corenlp.bash 下载corenlp 网速很慢，把文件中的链接复制，使用迅雷下载，速度会快一些。

最后将文件复制、解压到相应目录。

快速数据处理指令

根据数据集预处理内容准备数据 Dataset preprocessing

运行前将data_config中的所有没有搞到的数据集的配置文件全部删掉，不删可能会因为抛出异常跳过同类数据集。

$ tree data_config
data_config
├── absa
│   ├── pengb_14lap.yaml
│   ├── pengb_14res.yaml
│   ├── pengb_15res.yaml
│   └── pengb_16res.yaml
├── entity
│   └── conll03.yaml
├── entity_zh
│   └── zh_weibo.yaml
├── event
│   └── casie.yaml
└── relation
    ├── conll04.yaml
    ├── NYT-multi.yaml
    └── scierc.yaml

运行bash run_data_generation.bash 生成所有数据集，最终生成的目录。
运行bash run_sample.bash 采样形成低资源数据集

转换详细步骤说明

读取数据集配置文件并自动查找读取数据的任务格式（task_format ）类；
基于配置文件，任务格式实例读取数据；
根据不同的生成格式（generation_formats）生成对应格式的数据；

读取配置文件中的表情映射用于修改原始标注中的标签名
生成数据文件格式

数据集设置示例

# data_config/entity/conll03.yaml
name: conll03               # Dataset Name
path: data/conll03  # Dataset Folder
data_class: CoNLL03         # Task Format
split:                      # Dataset Split
  train: eng.train
  val: eng.testa
  test: eng.testb
language: en
mapper: # Label Mapper
  LOC: location
  ORG: organization
  PER: person
  MISC: miscellaneous

 $ tree converted_data/text2spotasoc/entity/conll03
converted_data/text2spotasoc/entity/conll03
├── entity.schema
├── event.schema
├── record.schema
├── relation.schema
├── test.json
├── train.json
└── val.json

Example of entity

{
  "text": "EU rejects German call to boycott British lamb .",
  "tokens": ["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."],
  "record": "<extra_id_0> <extra_id_0> organization <extra_id_5> EU <extra_id_1> <extra_id_0> miscellaneous <extra_id_5> German <extra_id_1> <extra_id_0> miscellan
eous <extra_id_5> British <extra_id_1> <extra_id_1>",
  "entity": [{"type": "miscellaneous", "offset": [2], "text": "German"}, {"type": "miscellaneous", "offset": [6], "text": "British"}, {"type": "organization", "offset": [0], "text": "EU"}],
  "relation": [],
  "event": [],
  "spot": ["organization", "miscellaneous"],
  "asoc": [],
  "spot_asoc": [{"span": "EU", "label": "organization", "asoc": []}, {"span": "German", "label": "miscellaneous", "asoc": []}, {"span": "British", "label": "miscellaneous", "asoc": []}]
}

小样本数据集采样

See details in run_sample.bash, it will generate all low-resource datasets for experiments.

bash run_sample.bash

Low-ratio smaple

 $ python scripts/sample_data_ratio.py -h
usage: sample_data_ratio.py [-h] [-src SRC] [-tgt TGT] [-seed SEED]

optional arguments:
  -h, --help  show this help message and exit
  -src SRC
  -tgt TGT
  -seed SEED

Usage: sample 0.01/0.05/0.1 of training instances for low-ratio experiments

python scripts/sample_data_ratio.py \
  -src converted_data/text2spotasoc/entity/mrc_conll03 \
  -tgt test_conll03_ratio

N-shot Sample

 $ python scripts/sample_data_shot.py -h
usage: sample_data_shot.py [-h] -src SRC -tgt TGT -task {entity,relation,event} [-seed SEED]

optional arguments:
  -h, --help            show this help message and exit
  -src SRC              Source Folder Name
  -tgt TGT              Target Folder Name, n shot sampled
  -task {entity,relation,event}
                        N-Shot Task name
  -seed SEED            Default is None, no random

Usage: sample 1/5-10-shot of training instances for low-shot experiments

python scripts/sample_data_shot.py \
  -src converted_data/text2spotasoc/entity/mrc_conll03 \
  -tgt test_conll03_shot \
  -task entity

Note:

-task indicates the target task: entity, relation and event

青萍之默

关注

6
点赞
踩
18

收藏

觉得还不错? 一键收藏
5
评论
菜鸟笔记-信息抽取模型UIE代码学习-数据准备

说明：UIE是基于Prompt的通用信息抽取框架，本文为个人学习UIE代码的笔记，学的过程中简单翻译了一下数据准备部分readme（顺序按实际处理过程有所调整），自己添加的说明都在引用格式里。论文：Unified Structure Generation for Universal Information Extraction。源代码：https://github.com/universal-ie/UIE本内容所在文件夹：dataset_processing。
复制链接

扫一扫

专栏目录