“ValueError: Path is not a directory: output/checkpoint-1500” transformer- wandb 大坑 自用mark

1 篇文章 0 订阅
1 篇文章 0 订阅

使用简单的 huggingface transformer 包 封装好的trainer训练模型 ,使用wandb踩坑了!

训练模型,保存模型时候报错

报错内容:

Traceback (most recent call last):
  File "/ml/debert-ML/Classcification.py", line 200, in <module>
    trainer.train()
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/transformers/trainer.py", line 2193, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/transformers/trainer.py", line 2589, in _maybe_log_save_evaluate
    self.control = self.callback_handler.on_save(self.args, self.state, self.control)
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/transformers/trainer_callback.py", line 403, in on_save
    return self.call_event("on_save", args, state, control)
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/transformers/trainer_callback.py", line 414, in call_event
    result = getattr(callback, event)(
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/transformers/integrations/integration_utils.py", line 842, in on_save
    artifact.add_dir(artifact_path)
  File "sdb/anaconda3/chy/lib/python3.9/site-packages/wandb/sdk/artifacts/artifact.py", line 1226, in add_dir
    raise ValueError("Path is not a directory: %s" % local_path)
ValueError: Path is not a directory: output/checkpoint-1500
wandb: ERROR Error uploading "/home/ruiy/.local/share/wandb/artifacts/staging/tmpoq_g6rdh": CommError, <Response [400]>
wandb: ERROR Uploading artifact file failed. Artifact won't be committed.

两个库版本号:

transformers                  4.39.0

wandb                         0.17.0

没有在网上找到报错原因和修改方法,

看了源码 发现一些端倪

遇到设置的保存步骤的时候

wandb要存入内容,但找到不到文件,

看了源码发现在特定步骤保存时,

先transformers 保存,

设置了最多保存5个checkpoint, transformers 的on_save会检查最优结果,只保存5个最优结果,不是最优就删除

再进行wandb内容保存

所以不是最优会找不到checkpoints文件

修改方法

vscode 打开 “anaconda3/chy/lib/python3.9/site-packages/wandb/sdk/artifacts/artifact.py”文件

找到报错

思路

将raise 报错改为 提示就行,后面加上else 不影响后续继续训

贴了完整的方法,再该文件夹定位到该方法替换就行,可以将原来代码备注

代码:
def add_dir(
        self,
        local_path: str,
        name: Optional[str] = None,
        skip_cache: Optional[bool] = False,
        policy: Optional[Literal["mutable", "immutable"]] = "mutable",
    ) -> None:
        """Add a local directory to the artifact.

        Arguments:
            local_path: The path of the local directory.
            name: The subdirectory name within an artifact. The name you specify appears
                in the W&B App UI nested by artifact's `type`.
                Defaults to the root of the artifact.
            skip_cache: If set to `True`, W&B will not copy/move files to the cache while uploading
            policy: "mutable" | "immutable". By default, "mutable"
                "mutable": Create a temporary copy of the file to prevent corruption during upload.
                "immutable": Disable protection, rely on the user not to delete or change the file.

        Raises:
            ArtifactFinalizedError: You cannot make changes to the current artifact
            version because it is finalized. Log a new artifact version instead.
            ValueError: Policy must be "mutable" or "immutable"
        """
        self._ensure_can_add()
        if not os.path.isdir(local_path):
            # raise ValueError("Path is not a directory: %s" % local_path)
            print("""raise ValueError("Path is not a directory: %s" % local_path)\n maybe not the best checkpoint not save the checkpoint """)
        else:
            termlog(
                "Adding directory to artifact (%s)... "
                % os.path.join(".", os.path.normpath(local_path)),
                newline=False,
            )
            start_time = time.time()

            paths = []
            for dirpath, _, filenames in os.walk(local_path, followlinks=True):
                for fname in filenames:
                    physical_path = os.path.join(dirpath, fname)
                    logical_path = os.path.relpath(physical_path, start=local_path)
                    if name is not None:
                        logical_path = os.path.join(name, logical_path)
                    paths.append((logical_path, physical_path))

            def add_manifest_file(log_phy_path: Tuple[str, str]) -> None:
                logical_path, physical_path = log_phy_path
                self._add_local_file(
                    name=logical_path,
                    path=physical_path,
                    skip_cache=skip_cache,
                    policy=policy,
                )

            num_threads = 8
            pool = multiprocessing.dummy.Pool(num_threads)
            pool.map(add_manifest_file, paths)
            pool.close()
            pool.join()

            termlog("Done. %.1fs" % (time.time() - start_time), prefix=False)

  • 7
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值