autodl常用工具&命令

刘义申汉

已于 2025-03-16 01:45:06 修改

阅读量4.3k

点赞数 29

分类专栏：工具 autodl实战文章标签： conda pytorch gpu算力

于 2024-04-09 11:06:29 首次发布

本文链接：https://blog.csdn.net/weixin_44498989/article/details/137543449

版权

工具同时被 2 个专栏收录

3 篇文章

订阅专栏

autodl实战

1 篇文章

订阅专栏

本文详细介绍了如何使用tar/zip命令进行文件操作，包括打包和解压缩；提供了conda环境管理的各种命令，如环境创建、重命名、复制、删除以及全版本torch包与CUDA版本的对应。还探讨了autodl中的错误处理，如HttpError和显存占用问题的解决方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

以下内容仅为当前认识，可能有不足之处，欢迎讨论！

tar/zip命令

参考链接

文件目录打包👉tar -cf <zip_name>.tar <directory_name/file_name>。

如果目录名或者文件名有多个，用空格分开就行。

文件目录解压缩👉tar xvf <zip_name>.tar

zip压缩当前目录下所有文件和文件夹压缩成zip文件。zip -r <file_name>.zip ./*

zip解压缩当前文件，unzip -o -d /directory <file_name>.zip，-o：不提示情况下覆盖文件，-d指明将文件加压缩到该目录下。

镜像版本参考

参考链接

常用镜像源

新版的Ubuntu要求使用https源

Conda源

anaconda的国内镜像源，主要用来加快使用conda下载安装python环境的速度

清华镜像：

https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free

https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/linux-64

https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/win-64

北京外国语大学镜像：

https://mirrors.bfsu.edu.cn/anaconda/pkgs/main

https://mirrors.bfsu.edu.cn/anaconda/pkgs/free

https://mirrors.bfsu.edu.cn/anaconda/pkgs/main/linux-64

https://mirrors.bfsu.edu.cn/anaconda/pkgs/main/win-64

阿里云镜像：

https://mirrors.aliyun.com/anaconda/pkgs/main

https://mirrors.aliyun.com/anaconda/pkgs/free

https://mirrors.aliyun.com/anaconda/pkgs/main/linux-64

https://mirrors.aliyun.com/anaconda/pkgs/main/win-64

Pypi源

Pypi的国内镜像源，主要用来加快pip下载安装第三方库的速度

清华镜像：https://pypi.tuna.tsinghua.edu.cn/simple

中科大镜像：https://pypi.mirrors.ustc.edu.cn/simple/

阿里云镜像：https://mirrors.aliyun.com/pypi/simple

华中理工大学镜像：https://pypi.hustunique.com/

山东理工大学镜像：https://pypi.sdutlinux.org/

豆瓣镜像：https://pypi.douban.com/simple/

torch包全版本下载

https://download.pytorch.org/whl/torch_stable.html

torch和cuda版本对应

Torch 1.7.0 对应 CUDA 11.0
Torch 1.6.0 对应 CUDA 10.2
Torch 1.5.0 对应 CUDA 10.1
Torch 1.4.0 对应 CUDA 10.0
Torch 1.3.0 对应 CUDA 9.2
Torch 1.2.0 对应 CUDA 9.0
Torch 1.1.0 对应 CUDA 9.0
Torch 1.0.0 对应 CUDA 8.0

conda命令

conda环境的诸多命令，我常用的。

conda打包

conda打包当前环境命令/pip，conda list -e > requirements.txt 或者pip导出pip freeze > requirements.txt。

conda批量安装环境包/pip，conda install --yes --file requirements.txt或者pip安装pip install -r requiremments.txt。

conda 环境重命名

可以先复制一个新的，然后删除旧的；也可以直接更改名字。

conda环境复制和转移

参考链接

conda create --name <new_env_name> --clone <now_env_name>

新电脑与当前电脑有相同的平台和操作系统

①保存当前环境信息到txt文件中，然后联网安装。

conda list --explicit > requirements.txt
conda create --name <new_env> --file requirements.txt

②利用打包命令直接打包，打包的是文件，复制到其他电脑后解压使用。

#安装conda-pack包
conda install -c conda-forge conda-pack / pip install conda-pack

#打包环境，生成压缩文件my_env.tar.gz压缩文件
conda pack -n <my_env>

#解压缩文件

解压缩文件到新的电脑上，解压到对应env目录下，先在env目录中用打包环境的名字创建一个文件夹如mkdir my_env，然后将压缩包解压到这个目录tar -xzvf my_env -C /anaconda3/envs/my_env

查看环境是否存在conda info -e。

conda环境删除

conda remove -n <env_name> --all

autodl显示Http Error

遇到问题：csdn transformers/trainer.py", line 3968, in create_accelerator_and_postprocess self.accelerator = Accelerator( NameError: name ‘Accelerator’ is not defined

用这个GitHub参考算是解决了，因为

在autoDL中打开conda配置vim ~/.condarc，使得能够更改源。更改源之后，终于能够正常下载了。

遇到问题：

RuntimeError: CUDA error: device-side assert triggered

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

main()
File “train.py”, line 744, in main
train_result = trainer.train(model_path=model_path)
File “/root/autodl-tmp/aliyun-race/simcse/trainers.py”, line 616, in train
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch , ignore_keys_for_eval=“None”)
File “/root/miniconda3/envs/aliyun/lib/python3.7/site-packages/transformers/trainer.py”, line 2321, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File “/root/autodl-tmp/aliyun-race/simcse/trainers.py”, line 141, in evaluate
results = se.eval(tasks)
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/engine.py”, line 59, in eval
self.results = {x: self.eval(x) for x in name}
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/engine.py”, line 59, in
self.results = {x: self.eval(x) for x in name}
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/engine.py”, line 127, in eval
self.results = self.evaluation.run(self.params, self.batcher) ##跳转到了sts.py文件中，计算去了
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/sts.py”, line 86, in run
enc1 = batcher(params, batch1) ##跳转到trainer中的batcher数据，准备编码 ,最后得到128*768, batch1是一个batch—_size的所有句子列表分元素
File “/root/autodl-tmp/aliyun-race/simcse/trainers.py”, line 124, in batcher
return pooler_output.cpu()

File “train.py”, line 781, in
main()
File “train.py”, line 744, in main
train_result = trainer.train(model_path=model_path)
File “/root/autodl-tmp/aliyun-race/simcse/trainers.py”, line 616, in train
#self._maybe_log_save_evaluate(tr_loss, model, trial, epoch , ignore_keys_for_eval=“None”)## 2024年3月24日
File “/root/miniconda3/envs/aliyun/lib/python3.7/site-packages/transformers/trainer.py”, line 2321, in _maybe_log_save_evaluate
metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
File “/root/autodl-tmp/aliyun-race/simcse/trainers.py”, line 141, in evaluate
results = se.eval(tasks)
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/engine.py”, line 59, in eval
self.results = {x: self.eval(x) for x in name}
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/engine.py”, line 59, in
self.results = {x: self.eval(x) for x in name}
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/engine.py”, line 127, in eval
self.results = self.evaluation.run(self.params, self.batcher) ##跳转到了sts.py文件中，计算去了
File “/root/autodl-tmp/aliyun-race/SentEval/senteval/sts.py”, line 86, in run
enc1 = batcher(params, batch1) ##跳转到trainer中的batcher数据，准备编码 ,最后得到128*768, batch1是一个batch—_size的所有句子列表分元素
File “/root/autodl-tmp/aliyun-race/simcse/trainers.py”, line 124, in batcher
return pooler_output.cpu()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

出现该问题，说明下载链接有问题，或者网不行，执行命令或者更换源就行。参考CSDN，以及额外的参考，参考2

Conda hangs after Solving environment HTTP Error 429

执行命令：conda config --set ssl_verify no/false

或者进入配置文件中更换源

(base) $ cat .condarc
channels:
  - conda-forge
#  - defaults
channel_priority: strict
ssl_verify: false
(base) $ vim .condarc

添加镜像网站，删除原来的镜像网站。以及更改为http

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/

SentEval安装&模型路径问题

问题：

File "train.py", line 776, in <module>
    main()
  File "train.py", line 515, in main
    config = AutoConfig.from_pretrained(model_args.config_name, **config_kwargs)
  File "/root/miniconda3/envs/cluster/lib/python3.7/site-packages/transformers/models/auto/configuration_auto.py", line 360, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/root/miniconda3/envs/cluster/lib/python3.7/site-packages/transformers/configuration_utils.py", line 420, in get_config_dict
    use_auth_token=use_auth_token,
  File "/root/miniconda3/envs/cluster/lib/python3.7/site-packages/transformers/file_utils.py", line 1056, in cached_path
    local_files_only=local_files_only,
  File "/root/miniconda3/envs/cluster/lib/python3.7/site-packages/transformers/file_utils.py", line 1235, in get_from_cache
    "Connection error, and we cannot find the requested files in the cached path."
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

出现这个问题，是找不到我的模型路径导致的。

那这个模型路径写成了什么？

尝试：/root/autodl-tmp/autodl-clusterNS/bert-base-uncased失败，换成./root失败，换成root失败。

想起来这个里面是空的，为什么是空的？

以及使用senteval包，需要额外的方式。不能使用pip install senteval，需要到SentEval文件夹中，用该命令python setup.py install才能安装好这个包。参考CSDN

修改代码：

data_dir:Optional[str]=field(
    default="data",
    metadata = {"help":"the data dir"}
)
    
cache_dir: Optional[str] = field(
    default="cache_dir",####希望将huggingface.co下载的预训练模型存储到哪里？
    metadata={"help": "Where do you want to store the pretrained models downloaded from huggingface.co"},
)

BATCH_SIZE = 512
BERT = 'bert-base-uncased'
#修改

同时将simcse中的init文件写入：

from .tool import SimCSE
from .trainers import CLTrainer,PATH_TO_SENTEVAL,PATH_TO_DATA
from .models import MLPLayer,Similarity,Pooler,cl_init,cl_forward,sentemb_forward,BertForCL,RobertaForCL,kmeans_cluster

比如打印路径。
2024年3月26日09点49分，经过init文件导入可以了。

显存占用

后来遇到显存占用的问题，也就是out of memory。用官方给出的解决办法，直接把所有的进程给删了。

ps -ef | grep train | awk '{print $2}' | xargs kill -9

或者使用代码清理下内存

import torch
torch.cuda.empty_cache()

日志重定向

日志重定向到train.log文件。即在你的命令后加上train.log 2>&1 ，完整的命令为👇

python {你的训练文件名称}.py > train.log 2>&1
# 实时查看日志👇
tail -f train.log

Transformer加速

如果遇到网络缓慢，使用autodl自带工具即可。命令为

source /etc/network_turbo

也可以指定huggingface的代理👇

export HF_ENDPOINT=https://hf-mirror.com 
export HF_HOME=/root/autodl-tmp/cache/

以及有什么包都可以在这里下载：https://pypi.mirrors.ustc.edu.cn/simple/

清华：https://pypi.tuna.tsinghua.edu.cn/simple
阿里云：http://mirrors.aliyun.com/pypi/simple/
中国科技大学 https://pypi.mirrors.ustc.edu.cn/simple/
华中理工大学：http://pypi.hustunique.com/
山东理工大学：http://pypi.sdutlinux.org/
豆瓣：http://pypi.douban.com/simple/

huggingface下载

设置镜像

set HF_ENDPOINT=https://hf-mirror.com

**export HF_ENDPOINT=https://hf-mirror.com**

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

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

参考：https://blog.csdn.net/TonyNotes/article/details/135828795

下载命令：huggingface-cli download --resume-download THUDM/chatglm3-6b --local-dir D:\Program\Python\TransfomersLearn\Models\chatglm3-6b

JupyterNotebook

添加内核

 pip install ipykernel -i https://pypi.tuna.tsinghua.edu.cn/simple 
 python -m ipykernel install --name <conda_name> --display-name <kernel_name>
 python -m ipykernel install --name rstc --display-name rstc

查看对应的内核列表

 jupyter kernelspec list

删除内核

 conda remove -n <conda_name> --all

删除环境

 jupyter kernelspec remove <kernel_name>

参考

https://blog.csdn.net/wuicer/article/details/126130987

以上是我的学习笔记，希望对你有所帮助！
如有不当之处欢迎指出！谢谢！

学吧，学无止境，太深了