快速获取 HuggingFace 文件下载地址实现批量断点续传

ehmy

已于 2024-03-08 17:24:06 修改

阅读量1.6k

点赞数 18

分类专栏： LLM 探索文章标签：语言模型

于 2024-03-08 16:27:03 首次发布

本文链接：https://blog.csdn.net/ehmy001/article/details/136565487

版权

本文介绍了如何通过分析HuggingFace的`huggingface_hub`库，修改下载脚本，实现在Windows 11上批量获取文件URL并使用下载工具进行断点续传。在下载过程中遇到网络错误时，原始方法会重试并可能导致重复下载。通过修改`_snapshot_download.py`，在下载前输出所有文件URL，可以利用下载工具的批量下载和断点续传功能，提高下载效率。

摘要由CSDN通过智能技术生成

最近需要下载一些开源大模型，对大模型进行对比评测。要下载的模型文件很多，并且都比较大，用浏览器下载比较忙还可能会频繁断开。也可以等浏览器文件下载开始后，将URL复制到下载工具下载，但在文件比较多的时候非常繁琐。

下载 HuggingFace 文件的官方方法

HuggingFace 提供了 huggingface_hub，可以用来批量下载指定模型的所有文件：

安装 `huggingface_hub`

pip install huggingface_hub

编写文件下载脚本存为 `download.py`

from huggingface_hub import snapshot_download
model_id="Tele-AI/telechat-7B"
snapshot_download(repo_id=model_id, local_dir="telechat-7B",
                  local_dir_use_symlinks=False, revision="main")

运行脚本批量下载所有文件

python download.py

存在问题

snapshot_download 在下载文件时，如果出现网络错误，会有限次数的重试。在问题无法恢复时，会自动退出。但是退出后如果重新下载，会全部重新开始下载，前期下载的全部作废，浪费很多时间。
如果能够批量获取到所有文件的URL，就可以使用下载工具批量下载。

解决思路

分析 snapshot_download 的代码，让他在下载文件之前自动把所有的下载路径输出，便于拷贝后粘贴到文件下载工具批量下载。

解决办法

本分析仅限于在 Windows 11 操作系统。

查找 `snapshot_download` 库位置

经过搜索文件名 snapshot_download，发现在在两个位置存在该文件 _snapshot_download.py：

C:\Users\You\.conda\pkgs\huggingface_hub-0.17.3-py311haa95532_0\Lib\site-packages\huggingface_hub
C:\PythonDir\Lib\site-packages\huggingface_hub

对这两个文件进行修改查看输出，发现 C:\PythonDir\Lib\site-packages\huggingface_hub 是需要修改的，第一个文件可能是前期安装 conda 自动安装的。

代码分析

经过分析，发现文件下载的调用链为：

_snapshot_download.py _inner_hf_hub_download() =》 file_download.py hf_hub_download()

file_download.py 中获取文件下载的代码为：

  url = hf_hub_url(repo_id, filename, repo_type=repo_type, revision=revision, endpoint=endpoint)

其中，hf_hub_url 方法的定义为：

@validate_hf_hub_args
def hf_hub_url(
    repo_id: str,
    filename: str,
    *,
    subfolder: Optional[str] = None,
    repo_type: Optional[str] = None,
    revision: Optional[str] = None,
    endpoint: Optional[str] = None,
) -> str:
    """Construct the URL of a file from the given information.

    The resolved address can either be a huggingface.co-hosted url, or a link to
    Cloudfront (a Content Delivery Network, or CDN) for large files which are
    more than a few MBs.

    Args:
        repo_id (`str`):
            A namespace (user or an organization) name and a repo name separated
            by a `/`.
        filename (`str`):
            The name of the file in the repo.
        subfolder (`str`, *optional*):
            An optional value corresponding to a folder inside the repo.
        repo_type (`str`, *optional*):
            Set to `"dataset"` or `"space"` if downloading from a dataset or space,
            `None` or `"model"` if downloading from a model. Default is `None`.
        revision (`str`, *optional*):
            An optional Git revision id which can be a branch name, a tag, or a
            commit hash.

    Example:

    ```python
    >>> from huggingface_hub import hf_hub_url

    >>> hf_hub_url(
    ...     repo_id="julien-c/EsperBERTo-small", filename="pytorch_model.bin"
    ... )
    'https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin'

修改下载脚本

直接说结论，在 _snapshot_download.py 修改两个地方：

首先，在代码头增加一行

from huggingface_hub import hf_hub_url

接着，在预处理完成后增加输出

在下述代码之前：

if HF_HUB_ENABLE_HF_TRANSFER:
    # when using hf_transfer we don't want extra parallelism
    # from the one hf_transfer provides
    for file in filtered_repo_files:
        _inner_hf_hub_download(file)
else:
    thread_map(
        _inner_hf_hub_download,
        filtered_repo_files,
        desc=f"Fetching {len(filtered_repo_files)} files",
        max_workers=max_workers,
        # User can use its own tqdm class or the default one from `huggingface_hub.utils`
        tqdm_class=tqdm_class or