AutoDL Llama2微调

huoxing080808

已于 2024-05-28 20:55:51 修改

阅读量401

点赞数 5

文章标签：深度学习 python 人工智能

于 2024-05-20 15:34:34 首次发布

本文链接：https://blog.csdn.net/huoxing080808/article/details/139048906

版权

参考博客：在Linux系统下部署Llama2（MetaAI）大模型教程_llama linux-CSDN博客

在Linux系统下微调Llama2（MetaAI）大模型教程—Qlora_autodl 微调llama2-CSDN博客

按照教程部署过程中遇到很多问题，写此博客以供大家参考，方便本人以后查看，如有误可以提出不同意见。

1.连不上huggingface

有两种报错信息 [Errno 110/403] Connection timed out

原博写的是终端版本，改成notebook版本才能顺利连上huggingface

import subprocess
import os

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout
for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

2.报错no gpu is found

先对cuda进行检验

#终端 cuda版本
nvcc --version
#查看gpu显存状态
nvidia-smi

#python
import torch
print(torch.__version__)
print(torch.cuda.is_available())

#终端
pip list
pip list | grep torch

nvidia的cuda版本和torch不匹配，由于某些包（transformers、peft、accelerate、trl）的依赖对于torch的要求，torch需要2.3.0版本，而torch2.3.0需要cuda至少为11.8才能正确运行。

而我最开始选的实例cuda版本只有11.6故导致找不到gpu。

其实我找了很多解决办法，第一种是降低其他包的版本，使得torch可以保持1.10.0版本，这样对cuda的要求只需要11.3就能达到，但翻阅包的对应需求很麻烦，最终放弃。

第二种是提高cuda版本，但autodl会报错。

第三种就是租一个cuda版本高的机器，缺点就是租金会贵哈哈

3.sentencepiece未安装

注意安装完要退出notebook重进，我装完了还报错又搞了好久

4.merge_model时 cuda out of memory

我最开始租的24g显存3090 在合并13b中文llama模型和我的微调结果时超出显存

结论换一个32g显存的v100

【已解决】探究CUDA out of memory背后原因，如何释放GPU显存？-CSDN博客

5.gradio运行llama报错

报错信息：

Could not create share link. Please check your internet connection or our status page

不能同时在终端打开学术加速，运行gradio前关闭学术加速

#关闭终端学术加速
unset http_proxy && unset https_proxy

#缺frpc_linux_amd64
wget https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64

6.跨区域迁移实例数据盘

#A拷到B，66666替换B端口号，region-3.autodl.com替换B主机，将如下命令输入A终端
scp -rP 66666 /root/autodl-tmp/xxxxx root@region-3.autodl.com:/root/autodl-tmp/

这里要注意两个实例都要开机，A用有卡开机，拷贝速度会快很多

（更新）7.wandb报错

错误代码：Network error (TransientError), entering retry loop

未解决

（更新）8.git-lfs

You seem to have cloned a repository without having git-lfs installed. Please install git-lfs and run `git lfs install` followed by `git lfs pull` in the folder you cloned.

apt-get update && apt-get install -y git-lfs

#到模型目录下执行
git lfs install

git config --global http.sslVerify true

git config --global http.sslBackend schannel

git lfs pull 出现错误：cannot write data to tempfile “/root/8dc6d01e84acccd8a5769d5“: LFS: unexpected EOF_cannot write 3967 bytes to tmp file-CSDN博客

未解决