[论文阅读笔记36]CASREL代码运行记录

最新推荐文章于 2024-08-08 07:35:05 发布

happyprince

最新推荐文章于 2024-08-08 07:35:05 发布

阅读量3.5k

点赞数 7

分类专栏： NER NLP 文章标签：机器学习 python

本文链接：https://blog.csdn.net/ld326/article/details/117184265

版权

NLP 同时被 2 个专栏收录

79 篇文章 6 订阅

订阅专栏

NER

39 篇文章 14 订阅

订阅专栏

本文档详细介绍了如何使用CASREL框架进行基于BERT的实体和关系抽取。作者分享了代码结构、环境配置、数据处理、模型训练、预测与评估的过程，并记录了遇到的问题及解决方案。此外，还探讨了模型对中文支持的修改方法。

摘要由CSDN通过智能技术生成

《[论文阅读笔记33]CASREL:基于标注与bert的实体与关系抽取》https://blog.csdn.net/ld326/article/details/116465089
总的来说，文档都还是写得很好的，按文档（readme.md）来就行，不过有点小小不同就是文件的命名，作一个补充记录。

0. 关于代码结构—值得学习，十分清晰

1. 关于环境

按说明的关键的几个句进行，可是依赖的包还是版本不对。这个是requirement.txt, 不过还是有些警告，先不处理警告:

absl-py==0.12.0
astor==0.8.1
blessings==1.7
cached-property==1.5.2
certifi==2020.12.5
gast==0.4.0
gpustat==0.4.1
grpcio==1.37.1
h5py==2.10.0
importlib-metadata==4.0.1
Keras==2.2.4
Keras-Applications==1.0.8
keras-bert==0.80.0
keras-embed-sim==0.8.0
keras-layer-normalization==0.14.0
keras-multi-head==0.27.0
keras-pos-embd==0.11.0
keras-position-wise-feed-forward==0.6.0
Keras-Preprocessing==1.1.2
keras-self-attention==0.46.0
keras-transformer==0.30.0
Markdown==3.3.4
mock==4.0.3
numpy==1.20.2
nvidia-ml-py3==7.352.0
protobuf==3.16.0
psutil==5.8.0
PyYAML==5.4.1
scipy==1.6.3
six==1.16.0
tensorboard==1.13.1
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
termcolor==1.1.0
tqdm==4.60.0
typing-extensions==3.10.0.0
Werkzeug==1.0.1
zipp==3.4.1

2. 关于数据

google那里，下载不是很方便，上传了一份(NYT)到csdn:https://download.csdn.net/download/ld326/18544111

3. 下载Bert

略

4. 数据预测处理

第一步，把下载下来的内容，把数字转成字符串；
这里把代码修改一下，想要做的工作是，把train,dev,test都要处理的：

for file_type in ['train', 'valid', 'test']:
    file_name = f'{file_type}.json'
    output = f'new_{file_type}.json'
    output_normal = f'new_{file_type}_normal.json'
    output_epo = f'new_{file_type}_epo.json'
    output_seo = f'new_{file_type}_seo.json'
    with open('relations2id.json', 'r') as f1, open('words2id.json', 'r') as f2:
        rel2id = json.load(f1)
        words2id = json.load(f2)
    rel_dict = {j: i for i, j in rel2id.items()}
    word_dict = {j: i for i, j in words2id.items()}
    load_data(file_name, word_dict, rel_dict, output, output_normal, output_epo, output_seo)

另外，build文件，修改一下文件路径，修改为对应生成的新文件就行；
还有两个test文件夹，修改一下文件路径，修改为对应生成的新文件；

5. 训练

python run.py --train=True --dataset=NYT

默认的参数为：

{
    "bert_model": "cased_L-12_H-768_A-12",
    "max_len": 100,
    "learning_rate": 1e-5,
    "batch_size": 6,
    "epoch_num": 100,
}

模型结构：
在这里插入图片描述

6. 预测评估

python run.py --dataset=NYT

在这里插入图片描述

结果与论文报告的基本相符的。
另外，抽取的结果也是可以看到：

"text": "But that spasm of irritation by a master intimidator was minor compared with what Bobby Fischer , the erratic former world chess champion , dished out in March at a news conference in Reykjavik , Iceland .",
"triple_list_gold": [
    {
        "subject": "Fischer",
        "relation": "/people/person/nationality",
        "object": "Iceland"
    },
    {
        "subject": "Fischer",
        "relation": "/people/deceased_person/place_of_death",
        "object": "Reykjavik"
    },
    {
        "subject": "Iceland",
        "relation": "/location/location/contains",
        "object": "Reykjavik"
    },
    {
        "subject": "Iceland",
        "relation": "/location/country/capital",
        "object": "Reykjavik"
    }
],
"triple_list_pred": [
    {
        "subject": "Fischer",
        "relation": "/people/person/nationality",
        "object": "Iceland"
    },
    {
        "subject": "Iceland",
        "relation": "/location/location/contains",
        "object": "Reykjavik"
    },
    {
        "subject": "Iceland",
        "relation": "/location/country/capital",
        "object": "Reykjavik"
    }
],
"new": [],
"lack": [
    {
        "subject": "Fischer",
        "relation": "/people/deceased_person/place_of_death",
        "object": "Reykjavik"
    }
]

7. 关于代码可能会出现的问题

代码运行过程问题记录：

Traceback (most recent call last):
  File "/opt/data/private/code/CasRel/run.py", line 40, in <module>
    subject_model, object_model, hbt_model = E2EModel(bert_config_path, bert_checkpoint_path, LR, num_rels)
  File "/opt/data/private/code/CasRel/model.py", line 15, in E2EModel
    bert_model = load_trained_model_from_checkpoint(bert_config_path, bert_checkpoint_path, seq_len=None)
  File "/opt/data/private/pyenvs/cas_rel_env/lib/python3.7/site-packages/keras_bert/loader.py", line 169, in load_trained_model_from_checkpoint
    **kwargs)
  File "/opt/data/private/pyenvs/cas_rel_env/lib/python3.7/site-packages/keras_bert/loader.py", line 58, in build_model_from_config
    **kwargs)
  File "/opt/data/private/pyenvs/cas_rel_env/lib/python3.7/site-packages/keras_bert/bert.py", line 126, in get_model
    adapter_activation=gelu,
TypeError: get_encoders() got an unexpected keyword argument 'use_adapter'

https://github.com/weizhepei/CasRel/issues/54

8. 关于是否支持中文

修改两个地方：第一处是pre-trained BERT; 第二处triple extraction 部分；

Hi, @fresh382227905. To make the model support Chinese, you may need
to change the pre-trained BERT and the triple extraction part (due to
the different tokenization between English and Chinese) with minor
revisions. You can also refer to @longlongman’s great work : )

参考:https://github.com/weizhepei/CasRel/issues/23
有这样的说法：

同意 @Phoeby2618
的说法，我试了（1）把中文分割成带空格的类似英文的格式，用代码里面的HBTokenizer（2）中文用原文，tokenizer用原生的Tokenier加上[unused1]，metric函数中把’
‘.join(sub.split(’[unused1]’))也改过来了。（3）中文用原文，tokenizer用原生的Tokenier不加[unused1]，metric同上。
前2者结果差不多。最后一种情况，pred的关系实体总是为0。应该是[unused1]不能随便去掉，暂时没搞清楚咋回事。

参考：https://github.com/weizhepei/CasRel/issues/50

另外还有一个项目直接是中文的：
https://github.com/longlongman/CasRel-pytorch-reimplement
对于项目要进行分词，采用DEMO例子[CMED数据集的结果]运行两者结果为：
CasRel-pytorch:

correct_num: 4927, predict_num: 8899, gold_num: 10610
epoch  39, eval time: 49.97s, f1: 0.51, precision: 0.55, recall: 0.46
saving the model, epoch:  39, best f1: 0.51, precision: 0.55, recall: 0.46

CasRel:

correct_num:4863.0000000001
predict_num:8697.0000000001
gold_num:10475.0000000001
f1: 0.5073, precision: 0.5592, recall: 0.4642, best f1: 0.5093

两都运行的结果差不多。

happyprince

关注

7
点赞
踩
41

收藏

觉得还不错? 一键收藏
50
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录