解决:PytorchStreamWriter failed writing file data


问题内容

今天在炼丹的时候,我发现模型跑到140步的时候保存权重突然报了个问题,详细内容如下:

Traceback (most recent call last):
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/serialization.py", line 423, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/serialization.py", line 650, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:450] . PytorchStreamWriter failed writing file data/1125: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ds_train.py", line 160, in <module>
    main()
  File "ds_train.py", line 135, in main
    model_engine.save_checkpoint(f"{cfg.output_dir}")
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2890, in save_checkpoint
    self._save_checkpoint(save_dir, tag, client_state=client_state)
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 3092, in _save_checkpoint
    self.checkpoint_engine.save(state, save_path)
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 22, in save
    torch.save(state_dict, path)
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/serialization.py", line 424, in save
    return
  File "/public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/serialization.py", line 290, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:325] . unexpected pos 286145984 vs 286145872
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:325] . unexpected pos 286145984 vs 286145872
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x2b48b09ef7d7 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x2fd16e0 (0x2b487a93e6e0 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #2: mz_zip_writer_add_mem_ex_v2 + 0x723 (0x2b487a9392c3 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: caffe2::serialize::PyTorchStreamWriter::writeRecord(std::string const&, void const*, unsigned long, bool) + 0xb5 (0x2b487a941835 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: caffe2::serialize::PyTorchStreamWriter::writeEndOfFile() + 0x2c3 (0x2b487a941d43 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: caffe2::serialize::PyTorchStreamWriter::~PyTorchStreamWriter() + 0x125 (0x2b487a941ff5 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x66a353 (0x2b486ea82353 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x23c986 (0x2b486e654986 in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x23debe (0x2b486e655ebe in /public/home/dyedd/.conda/envs/diffusers/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x110632 (0x560fa3677632 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #10: <unknown function> + 0x110059 (0x560fa3677059 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #11: <unknown function> + 0x110043 (0x560fa3677043 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #12: <unknown function> + 0x110043 (0x560fa3677043 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #13: <unknown function> + 0x110043 (0x560fa3677043 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #14: <unknown function> + 0x110043 (0x560fa3677043 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #15: <unknown function> + 0x110043 (0x560fa3677043 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #16: <unknown function> + 0x177ce7 (0x560fa36dece7 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #17: PyDict_SetItemString + 0x4c (0x560fa36e1d8c in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #18: PyImport_Cleanup + 0xaa (0x560fa3754a2a in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #19: Py_FinalizeEx + 0x79 (0x560fa37ba4c9 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #20: Py_RunMain + 0x1bc (0x560fa37bd83c in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #21: Py_BytesMain + 0x39 (0x560fa37bdc29 in /public/home/dyedd/.conda/envs/diffusers/bin/python)
frame #22: __libc_start_main + 0xf5 (0x2b4852ea13d5 in /lib64/libc.so.6)
frame #23: <unknown function> + 0x1f9ad7 (0x560fa3760ad7 in /public/home/dyedd/.conda/envs/diffusers/bin/python)

问题分析

这个问题实际上是说,模型权重在保存的时候不完整。

这时候我惊呆了,我的模型都已经保存了140次了,怎么回事?难道是我的程序写出了隐藏BUG?

吃惊的同时召唤魔法去搜索,果然网友也有我的这个问题,但是他们说保存的目录有问题。我立马转回去看我的权重保存路径,没错呢,程序也不可能自动删除目录呀。看来我们遇到的不是同一个问题。

直到…

我看了下的硬盘,好家伙。怪不得模型权重保存不完整,原先是我的硬盘被前面的140次保存给吃饱了,一丁点空间都没有~

哎,十分感叹,果然大模型的时代,训练的东西都不是小孩,在以前就是保存几千次都没有这么多问题。

解决思路

所以这个问题如何解决呢?

6ONjy.gif

那就是删除被占用的空间,当然你还可以设置保存权重的频率,例如在deepspeed:

for epoch in range(cfg.num_epochs) :
    model_engine.train()
    for i, data in enumerate(training_dataloader):
    	if i % cfg.save_interval == 0:
        	# save checkpoint
            model_engine.save_checkpoint(f"{cfg.output_dir}")

又回到刚刚说的删除被占用的空间,我建议不要全删了,因为deepspeed会把目前最优的权重文件夹保存在latest文件,你只要双击查看,然后删除多余的其它文件即可。

这就是典型的排他思想,哈哈,又回想起学JS的时候Pink老师说的。

代码我用GPT4写好了,并且经过了充分的验证:

import os
import shutil

def cleanup_except_latest(directory_path):
    # 读取 latest 文件的内容
    latest_file_path = os.path.join(directory_path, "latest")
    with open(latest_file_path, "r") as file:
        # 读取要保留的文件夹名称
        folders_to_keep = file.read().strip().split('\n')

    # 添加"latest"到保留列表
    folders_to_keep.append("latest")

    # 获取目录下的所有文件和文件夹
    all_items = os.listdir(directory_path)

    # 过滤出所有文件夹
    folders = [item for item in all_items if os.path.isdir(os.path.join(directory_path, item))]

    # 删除不在保留列表中的文件夹
    for folder in folders:
        if folder not in folders_to_keep:
            folder_path = os.path.join(directory_path, folder)
            shutil.rmtree(folder_path)

    # 返回更新后的目录内容
    print(os.listdir(directory_path))


if __name__ == '__main__':
    cleanup_except_latest("/home/dcuuser/dxm/diffusers/train/FineTunedStableDiffusion-lora")

在deepspeed重新训练的时候会检测到当前保留的最优值,然后继续开始训练的,所以不要担心删除了会造成什么影响。

`std::ios::binary` 是 C++ 中用于打开文件的打模式之一。 在打开文件时,可以通过指定 `std::ios::binary` 打开模式来确保以二进制模式读取或写入文件。这种模式对于处理二进制文件(例如图像、音频或视频)非常有用,因为它可以确保文件以二进制格式进行读写,而不会对数据进行任何额外的转换或处理。 以下是一些使用 `std::ios::binary` 的示例: ```cpp #include <iostream> #include <fstream> int main() { // 以二进制模式写入数据到文件 std::ofstream outputFile("data.bin", std::ios::binary); if (outputFile) { int data[] = { 1, 2, 3, 4, 5 }; outputFile.write(reinterpret_cast<const char*>(data), sizeof(data)); outputFile.close(); std::cout << "Data written to file." << std::endl; } else { std::cout << "Failed to open file for writing." << std::endl; } // 以二进制模式读取文件中的数据 std::ifstream inputFile("data.bin", std::ios::binary); if (inputFile) { int data[5]; inputFile.read(reinterpret_cast<char*>(data), sizeof(data)); inputFile.close(); std::cout << "Data read from file: "; for (int i = 0; i < 5; i++) { std::cout << data[i] << " "; } std::cout << std::endl; } else { std::cout << "Failed to open file for reading." << std::endl; } return 0; } ``` 在上面的示例中,我们首先以二进制模式将一些整数数据写入到名为 "data.bin" 的文件中,然后再以二进制模式从文件中读取数据并显示在控制台上。 通过指定 `std::ios::binary` 打开模式,我们可以确保数据以二进制形式进行读写,而不会进行任何额外的转换或处理。 希望这能帮助您理解 `std::ios::binary` 在C++中的使用。如有任何疑问,请随时提问。
评论 75
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

染念

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值