AI4S Cup - LLM挑战赛 - 大模型提取“基因-疾病-药物”知识图谱-Rank4解决方案-CSDN博客

本文链接：https://blog.csdn.net/m0_37733448/article/details/138671439

背景

上俩个月跟队友参加了Bohrium平台的大模型提取基因-疾病-药物的信息抽取比赛，5月初刚好公布成绩，所以水个贴记录一下过程。

比赛任务

这个比赛有三个子任务，每个子任务需要抽取对应的三元组或者二元组的对应信息，前期的工作主要是数据清洗，后面才是调整prompt和切换预训练模型以及调参。

子任务一

225篇PubMed文献摘要（./task1）及其抽取出的三元组关系（./task1/train_triad.xlsx），

主要是抽取基因和疾病的关系【REG、LOF、GOF、COM】； A榜测试集25篇，B榜测试集50篇，数据样本小，给B榜带来了极大的不确定性。

给定的train_triad.xlsx可能由于人工编辑的，存在部分数据是匹配不是原文的，后面利用fuzz发现并纠正这部分的异常数据。

子任务二

500篇PubMed文献和来自文献的人工标注的2901条“化合物-疾病关联对”(./task2)

子任务二是抽取二元组，判断化学物与疾病是否存在对应关系；A榜测试集224篇， B榜50篇。

子任务二数据中比较大的是大小写匹配问题，需要统一转换后才能匹配上对应的实体。且部分实体是在title中出现的。这部分数据处理，印象中是召回了~1/5左右的实体成功匹配。

子任务三

来自DrugBank数据库描述的533条文本中提取出的8642种药物以及其之间4075种药物-药物相互作用(./task3)

涉及到的关系：【effect、advise、mechanism、int】

A榜测试集224篇， B榜50篇。数据相对任务一和任务二正常一些。

prompt调整

在baseline的基础上修改prompt，效果提升~1%

prompt1 = "You are a genetic disease expert. In this Gene-Disease relation extraction task, you need to follow 3 steps. You need to extract the [gene, function change, disease] triplet from the text, such as: [SHROOM3, LOF, Neural tube defects]. The second element in the triple means the regulation that the gene produces to the disease. Types of regulations are: LOF and GOF, which indicate loss or gain of function; REG, which indicates a general regulatory relationship; COM, which indicates that the functional change between genes and diseases is more complex, and it is difficult to determine whether the functional change is LOF or GOF. Please return all the relations extracted from the text in ternary format [[GENE, FUNCTION, DISEASE]]."
prompt2 = "You are a biologist. I'll give you the abstract of literature. Please identify all the [[compound,disease]] relations in the abstract, and just give me a list of all relations you recognized"
prompt3 = "You are a medicinal chemist. Now you need to identify all the drug-drug interactions from the text I provide to you, and please only write down all the drug-drug interactions in the format of [[drug, interaction, drug]]. "

训练

用的是LLaMa-Factory的仓库代码进行LoRA微调的，途中切换过gemma-7b、llama-7b、Mistral-7B-Instruct-v0.2

但在A榜的测试结果显示，gemma-7b的效果最好，比Mistral 和 llama高~2个点左右

MODEL="/mnt2/pretrained_model/LLM/gemma-7b"

CUDA_VISIBLE_DEVICES=0 python3 train_bash.py \
    --stage sft \
    --model_name_or_path $MODEL \
    --do_train True \
    --overwrite_cache True \
    --overwrite_output_dir True \
    --finetuning_type lora \
    --template gemma \
    --dataset_dir data \
    --dataset ai4s \
    --cutoff_len 1536 \
    --learning_rate 5e-05 \
    --num_train_epochs 5.0 \
    --max_samples 2000 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 10.0 \
    --logging_steps 50 \
    --save_steps 100 \
    --warmup_steps 0 \
    --flash_attn False \
    --lora_rank 8 \
    --lora_dropout 0.1 \
    --lora_target q_proj,v_proj \
    --output_dir output \
    --fp16 True \
    --val_size 0.1 \
    --evaluation_strategy steps \
    --eval_steps 100 \
    --load_best_model_at_end True \
    --report_to tensorboard \