记autodl跑模型GPU CPU利用率骤变为0问题

文章讲述了博主在使用autodl的网络共享存储进行模型训练时遇到的问题,即GPU和CPU利用率骤降但显存占用,原因是TensorBoard的IO调用错误。通过将程序移动到实例本地数据盘,解决了训练卡顿问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

目录

问题 

解决 


问题 

        实验室服务器资源紧张,博主就自己在autodl上租卡跑了,autodl有一个网络共享存储,可挂载至同一地区的不同实例中,当我们在该地区创建实例开机后,将会挂载文件存储至实例的/root/autodl-fs目录,以实现不同实例间的数据共享。

        那当我们之前使用的卡被别人占用后,可以直接在租的新卡上访问该网络共享存储上的数据代码,就能省掉文件传来传去的冗余读写烦恼了。于是博主一直在该共享盘上修改模型。但最近博主复现模型的时候,模型总是卡在某epoch处,监控服务器状态,发现GPU和CPU利用率突然骤降为0,但程序依然占用显存,且训练过程中会出现如下线程控制警告?

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-9:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self.run()
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/event_file_writer.py", line 233, in run
    self._record_writer.write(data)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._record_writer.write(data)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._record_writer.write(data)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._record_writer.write(data)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self._record_writer.write(data)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/summary/writer/record_writer.py", line 40, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self._writer.write(header + header_crc + data + footer_crc)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 766, in write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 164, in _write
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 164, in _write
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self.fs.append(self.filename, file_content, self.binary_mode)
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 160, in append
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 164, in _write
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 164, in _write
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'runs/Feb28_22-30-35_autodl-container-b8bc118052-8d77dd6aCombined_hinet_pretrain100_debug_MSG_imgPatchSE_mean/Total_Loss_Total Loss/events.out.tfevents.1709130663.autodl-container-b8bc118052-8d77dd6a.1871.1steg'
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'runs/Feb28_22-30-35_autodl-container-b8bc118052-8d77dd6aCombined_hinet_pretrain100_debug_MSG_imgPatchSE_mean/error_msg_average bit error/events.out.tfevents.1709130663.autodl-container-b8bc118052-8d77dd6a.1871.6steg'
    self._write(filename, file_content, "ab" if binary_mode else "a")
  File "/root/miniconda3/lib/python3.8/site-packages/tensorboard/compat/tensorflow_stub/io/gfile.py", line 164, in _write
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'runs/Feb28_22-30-35_autodl-container-b8bc118052-8d77dd6aCombined_hinet_pretrain100_debug_MSG_imgPatchSE_mean/acc_msg_average accuracy/events.out.tfevents.1709130663.autodl-container-b8bc118052-8d77dd6a.1871.7steg'
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'runs/Feb28_22-30-35_autodl-container-b8bc118052-8d77dd6aCombined_hinet_pretrain100_debug_MSG_imgPatchSE_mean/rs_loss_reconstruct_secret loss/events.out.tfevents.1709130663.autodl-container-b8bc118052-8d77dd6a.1871.3steg'
    with io.open(filename, mode, encoding=encoding) as f:
FileNotFoundError: [Errno 2] No such file or directory: b'runs/Feb28_22-30-35_autodl-container-b8bc118052-8d77dd6aCombined_hinet_pretrain100_debug_MSG_imgPatchSE_mean/steg_loss_embedded loss/events.out.tfevents.1709130663.autodl-container-b8bc118052-8d77dd6a.1871.2steg'

解决 

        挠破脑袋,查阅各种网络资料后,我怀疑问题出在我的tensorboard的IO调用上,然后我又查了autodl关于网络共享存储的帮助文档,果然啊!虽然这个共享盘可以实现实例间的共享,还能冗余备份,保护咱们代码财产安全(博主就碰到过一次:刚改完代码跑着模型,服务器突然报下线维修,请联系客服...还好咱的程序都在共享存储盘上,没丢),但是IO性能一般,影响模型训练过程。

        后续,我把程序拷贝至实例本地数据盘后,模型莫名的训练卡顿问题就解决啦!

     

基于ARIMAX的多变量预测模型python源码+数据集(下载即用),个人经导师指导并认可通过的高分设计项目,评审分99分,代码完整确保可以运行,小白也可以亲自搞定,主要针对计算机相关专业的正在做毕业设计、大作业的学生和需要项目实战练习的学习者,可作为毕业设计、课程设计、期末大作业,代码资料完整,下载可用。 基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即用)基于ARIMAX的多变量预测模型python源码+数据集(下载即
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值