微调Paddle UIE模型实现命名实体抽取

Tom 1988

已于 2024-02-22 19:35:24 修改

阅读量1.2k

点赞数

分类专栏：机器学习实践记录文章标签：华为云 paddle python

于 2023-04-20 00:40:25 首次发布

本文链接：https://blog.csdn.net/xu_guo_jie/article/details/130256066

版权

机器学习实践记录专栏收录该内容

7 篇文章 1 订阅

订阅专栏

一、创建虚拟环境

好习惯，首先创建单独的运行环境

conda create -n uie python=3.10.9
conda activate uie

二、安装paddle框架及paddlenlp

2.1 参考官方文档安装paddle

开始使用_飞桨-源于产业实践的开源深度学习平台

首先查看自己服务器cuda版本，如下我的版本时10.2

(PyTorch-1.8) [ma-user work]$nvidia-smi
Wed Apr 19 23:35:11 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:0E.0 Off |                    0 |
| N/A   39C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

在Paddle官网直接复制命令即可。

2.2 安装paddlenlp

pip install --upgrade paddlenlp

2.2.1 问题一 ERROR: Failed building wheel for numpy Failed to build numpy

-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o -MMD -MF build/temp.linux-x86_64-3.10/build/src.linux-x86_64-3.10/numpy/core/src/multiarray/scalartypes.o.d" failed with exit status 1
            [end of output]
      
        note: This error originates from a subprocess, and is likely not a problem with pip.
        ERROR: Failed building wheel for numpy
      Failed to build numpy
      ERROR: Could not build wheels for numpy, which is required to install pyproject.toml-based projects
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install backend dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip

手工安装numpy包，再次执行nlp包安装，还是不行。

pip install numpy

换另外一种方式成功

python3 -m pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple

三、下载PaddleNLP源码

$git clone https://github.com/PaddlePaddle/PaddleNLP.git

四、执行训练

4.1、对标注数据进行预处理

python ../PaddleNLP/model_zoo/uie/doccano.py --doccano_file ./data.json --task_type ext --save_dir ./ --splits 0.7 0.2 0.1 --schema_lang ch

4.2、模型精调

$python ../PaddleNLP/model_zoo/uie/finetune.py 
 --device gpu   
 --logging_steps 10  
 --save_steps 100    
 --eval_steps 100    
 --seed 42           
 --model_name_or_path uie-base    
 --output_dir $finetuned_model   
 --train_path ./train.txt  
 --dev_path ./dev.txt   
 --max_seq_length 512      
 --per_device_eval_batch_size 16     
 --per_device_train_batch_size  16     
 --num_train_epochs 20     
 --learning_rate 1e-5     
 --label_names "start_positions" "end_positions"    
 --do_train     
 --do_eval     
 --do_export     
 --export_model_dir $finetuned_model     
 --overwrite_output_dir     
 --disable_tqdm True     
 --metric_for_best_model eval_f1     
 --load_best_model_at_end  True     
 --save_total_limit 1

出现下图及训练成功

五、模型应用

from pprint import pprint
from paddlenlp import Taskflow
schema = ['时间', '地区', '指标名']
ie = Taskflow('information_extraction', schema=schema, task_path="./checkpoint/model_best")
pprint(ie("我想查询2022年山东省主营业务收入数据"))