Hugging Face和ModelScope大模型/数据集的下载加速方法

花花少年

已于 2024-09-23 16:41:25 修改

阅读量3.6k

点赞数 35

分类专栏：深度学习文章标签： Hugging Face ModelScope 大模型 LLM

于 2024-08-07 09:23:09 首次发布

本文链接：https://blog.csdn.net/m0_37605642/article/details/140968435

版权

深度学习专栏收录该内容

135 篇文章

订阅专栏

重要说明：本文从网上资料整理而来，仅记录博主学习相关知识点的过程，侵删。

一、参考资料

【实战教程】linux系统下载huggingface大模型教程

如何快速下载huggingface模型——全方法总结

【AI之路】使用huggingface_hub优雅解决huggingface大模型下载问题

二、如何下载非公开模型?

1. 概述

由于模型发布者的版权的要求，部分模型无法公开访问下载，需要在 huggingface 上申请许可通过后，才可以下载。这类模型称之为 Gated Model。

下载非公开模型的基本步骤包括：

申请许可。
获取 access token（用于命令行和python方法调用）。
下载模型。

2. 申请许可

需要注意，必须在 huggingface 官网注册登录后申请许可，由于网络安全原因，镜像站一般不支持。

申请后一般等待几分钟到几天不等（一般几分钟就行），会发邮件通知你审批结果。

在这里插入图片描述

3. 获取Access Token

Access Token 获取地址：https://huggingface.co/settings/tokens

访问 huggingface 设置页面的 token 管理页，选择 New 一个 token，只需要 Read 权限即可，创建后便可以在工具中调用时使用了。

申请通过后，就可以在模型主页的 Files and versions 中看到模型文件了，浏览器的话直接点击下载即可。但是如果想要用工具例如 huggingface-cli 下载，则需要获取Access Token。

在这里插入图片描述

4. 登录Hugging Face 账户

推荐使用下述命令登录您的 Hugging Face 账户。

pip install --upgrade huggingface_hub
huggingface-cli login

(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful

5. 下载模型

https

git clone https://<hf_username>:<hf_token>@huggingface.co/meta-llama/Llama-2-7b-chat-hf

huggingface-cli：添加 --token 参数

huggingface-cli download --token hf_xxx --resume-download bigscience/bloom-560m --local-dir bloom-560m

curl, wget：在 header 中添加 token

curl -L --header "Authorization: Bearer hf_xxx" -o model-00001-of-00002.safetensors https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors

wget --header "Authorization: Bearer hf_xxx" https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors

snapshot_download：调用 login 方法

import huggingface_hub
huggingface_hub.login("hf_xxx")

6. 从ModelScope下载

LLM-Research/Meta-Llama-3-8B-Instruct

下载非公开模型，推荐从ModelScope魔塔社区下载。以 LLM-Research/Meta-Llama-3-8B-Instruct 为例。

git clone https://www.modelscope.cn/llm-research/meta-llama-3-8b-instruct.git

在这里插入图片描述

三、Hugging Face模型下载

Download files from the Hub

0. 准备环境

# 安装 git-lfs
sudo apt-get install git-lfs
git lfs install

# 安装huggingface_hub
pip install -U huggingface_hub

# 安装 aria2
sudo apt install aria2

# 设置镜像端点
export HF_ENDPOINT="https://hf-mirror.com"

1. hfd方式（推荐）

Huggingface Model Downloader

hfd 是基于 Git 和 aria2 实现的专用于huggingface 下载的命令行脚本。

专用多线程下载器 hfd的原理：

Step1：Git clone 项目仓库中lfs文件之外的所有文件，并自动获取 lfs 文件的 url；
Step2：利用 aria2 多线程下载文件。

1.1 创建脚本

创建 hfd.sh文件：

#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT

display_help() {
    cat << EOF
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]    

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
    exit 1
}

MODEL_ID=$1
shift

# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}

while [[ $# -gt 0 ]]; do
    case $1 in
        --include) INCLUDE_PATTERN="$2"; shift 2 ;;
        --exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
        --hf_username) HF_USERNAME="$2"; shift 2 ;;
        --hf_token) HF_TOKEN="$2"; shift 2 ;;
        --tool) TOOL="$2"; shift 2 ;;
        -x) THREADS="$2"; shift 2 ;;
        --dataset) DATASET=1; shift ;;
        --local-dir) LOCAL_DIR="$2"; shift 2 ;;
        *) shift ;;
    esac
done

# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
    if ! command -v $1 &>/dev/null; then
        echo -e "${RED}$1 is not installed. Please install it first.${NC}"
        exit 1
    fi
}

# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
    if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
        git config --global --add safe.directory "${PWD}"
        printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" 
    fi
}

[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs

[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help

if [[ -z "$LOCAL_DIR" ]]; then
    LOCAL_DIR="${MODEL_ID#*/}"
fi

if [[ "$DATASET" == 1 ]]; then
    MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"

if [ -d "$LOCAL_DIR/.git" ]; then
    printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
    cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
    REPO_URL="$HF_ENDPOINT/$MODEL_ID"
    GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
    echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
    response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
    if [ "$response" == "401" ] || [ "$response" == "403" ]; then
        if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
            printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
            exit 1
        fi
        REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
    elif [ "$response" != "200" ]; then
        printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
        printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
        curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
    fi
    echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"

    GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }

    ensure_ownership

    while IFS= read -r file; do
        truncate -s 0 "$file"
    done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi

printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls

while IFS= read -r file; do
    url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
    file_dir=$(dirname "$file")
    mkdir -p "$file_dir"
    if [[ "$TOOL" == "wget" ]]; then
        download_cmd="wget -c \"$url\" -O \"$file\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
    else
        download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
    fi
    [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    printf "%s\n" "$download_cmd"
    urls+=("$url|$file")
done <<< "$files"

for url_file in "${urls[@]}"; do
    IFS='|' read -r url file <<< "$url_file"
    printf "${YELLOW}Start downloading ${file}.\n${NC}" 
    file_dir=$(dirname "$file")
    if [[ "$TOOL" == "wget" ]]; then
        [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
    else
        [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
    fi
    [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done

printf "${GREEN}Download completed successfully.\n${NC}"

1.2 准备环境

执行脚本之前，务必安装 git-lfs 和 aria2，请参考前文【准备环境】。

1.3 hfd指令

# 修改权限
chmod a+x hfd.sh

# 创建指令
alias hfd="$PWD/hfd.sh"

# 查看指令
hfd -h

root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/360Downloads# hfd -h
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

1.4 下载模型

查看 Access Tokens：https://huggingface.co/settings/tokens

# 无token下载
hfd bigscience/bloom-560m --tool aria2c -x 12

# 排除不下载的文件
hfd bigscience/bloom-560m --exclude *.safetensors

# 有token下载
hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

如果 aria2c 下载较慢，尝试使用设置镜像端点或者使用 wget 下载工具：

# 设置wget下载工具
hfd bigscience/bloom-560m --tool wget -x 12

下载 bigscience/bloom-560m 示例如下：

root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models# hfd bigscience/bloom-560m
Downloading to bloom-560m
Testing GIT_REFS_URL: https://hf-mirror.com/bigscience/bloom-560m/info/refs?service=git-upload-pack
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/bigscience/bloom-560m bloom-560m
Cloning into 'bloom-560m'...
remote: Enumerating objects: 111, done.
remote: Counting objects: 100% (111/111), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 111 (delta 55), reused 111 (delta 55), pack-reused 0 (from 0)
Receiving objects: 100% (111/111), 28.49 KiB | 1.58 MiB/s, done.
Resolving deltas: 100% (55/55), done.

Start Downloading lfs files, bash script:
cd bloom-560m
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/model.safetensors" -d "." -o "model.safetensors"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx_data" -d "onnx" -o "decoder_model.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx_data" -d "onnx" -o "decoder_model_merged.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx_data" -d "onnx" -o "decoder_with_past_model.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/tokenizer.json" -d "onnx" -o "tokenizer.json"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/tokenizer.json" -d "." -o "tokenizer.json"
Start downloading flax_model.msgpack.
[#33d36a 1.0GiB/1.0GiB(100%) CN:1]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
33d36a|OK  |    47MiB/s|./flax_model.msgpack

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/flax_model.msgpack successfully.
Start downloading model.safetensors.
[#0ff500 1.0GiB/1.0GiB(99%) CN:4 DL:43MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
0ff500|OK  |    44MiB/s|./model.safetensors

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/model.safetensors successfully.
Start downloading onnx/decoder_model.onnx.
[#75f561 32KiB/691KiB(4%) CN:1 DL:141KiB ETA:4s]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
75f561|OK  |   1.1MiB/s|onnx/decoder_model.onnx

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx successfully.
Start downloading onnx/decoder_model.onnx_data.
 *** Download Progress Summary as of Tue Aug  6 17:46:12 2024 ***
======================================================================================================
[#3b16c1 2.4GiB/3.0GiB(80%) CN:4 DL:42MiB ETA:14s]
FILE: onnx/decoder_model.onnx_data
------------------------------------------------------------------------------------------------------

[#3b16c1 3.0GiB/3.0GiB(99%) CN:2 DL:35MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
3b16c1|OK  |    42MiB/s|onnx/decoder_model.onnx_data

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx_data successfully.
Start downloading onnx/decoder_model_merged.onnx.
[#dd17b7 0B/0B CN:1 DL:0B]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
dd17b7|OK  |   3.0MiB/s|onnx/decoder_model_merged.onnx

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx successfully.
Start downloading onnx/decoder_model_merged.onnx_data.
[#1b793b 3.0GiB/3.0GiB(99%) CN:3 DL:52MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1b793b|OK  |    52MiB/s|onnx/decoder_model_merged.onnx_data

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx_data successfully.
Start downloading onnx/decoder_with_past_model.onnx.
[#c371d9 0B/0B CN:1 DL:0B]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
c371d9|OK  |   1.0MiB/s|onnx/decoder_with_past_model.onnx

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx successfully.
Start downloading onnx/decoder_with_past_model.onnx_data.
 *** Download Progress Summary as of Tue Aug  6 17:48:32 2024 ***
======================================================================================================
[#1988c4 2.6GiB/3.0GiB(85%) CN:4 DL:48MiB ETA:9s]
FILE: onnx/decoder_with_past_model.onnx_data
------------------------------------------------------------------------------------------------------

[#1988c4 3.0GiB/3.0GiB(99%) CN:1 DL:32MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1988c4|OK  |    44MiB/s|onnx/decoder_with_past_model.onnx_data

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx_data successfully.
Start downloading onnx/tokenizer.json.
[#300f33 13MiB/13MiB(94%) CN:2 DL:4.3MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
300f33|OK  |   4.1MiB/s|onnx/tokenizer.json

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/tokenizer.json successfully.
Start downloading pytorch_model.bin.
[#0ad147 1.0GiB/1.0GiB(99%) CN:2 DL:22MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
0ad147|OK  |    24MiB/s|./pytorch_model.bin

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/pytorch_model.bin successfully.
Start downloading tokenizer.json.
[#283f59 7.5MiB/13MiB(54%) CN:4 DL:3.7MiB ETA:1s]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
283f59|OK  |   5.5MiB/s|./tokenizer.json

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/tokenizer.json successfully.
Download completed successfully.

2. huggingface-cli+hf_transfer

huggingface-cli 和 hf_transfer 是 hugging face 官方提供的专门为下载而设计的工具链。前者是一个命令行工具，后者是下载加速模块。

2.1 huggingface-cli（不推荐）

huggingface-cli 隶属于 huggingface_hub 库，不仅可以下载模型、数据，还可以可以登录huggingface、上传模型、数据等。

pip install -U huggingface_hub

# 下载模型（断点续传）
huggingface-cli download --resume-download bigscience/bloom-560m --local-dir bloom-560m

# 下载数据集
huggingface-cli download --resume-download --repo-type dataset lavita/medical-qa-shared-task-v1-toy

值得注意的是，有个--local-dir-use-symlinks False 参数可选，因为huggingface的工具链默认会使用符号链接来存储下载的文件，导致--local-dir指定的目录中都是一些“链接文件”，真实模型则存储在~/.cache/huggingface下，如果不喜欢这个可以用 --local-dir-use-symlinks False取消这个逻辑。

2.2 hf_transfer（推荐）

hf_transfer 依附并兼容 huggingface-cli，是 hugging face 官方专门为提高下载速度基于 Rust 开发的一个模块，开启后在带宽充足的机器上可以跑满带宽。

# 安装hf_transfer
pip install -U hf-transfer

# 设置环境变量
export HF_HUB_ENABLE_HF_TRANSFER=1

# 开启后使用方法同 huggingface-cli
huggingface-cli download --resume-download bigscience/bloom-560m --local-dir bloom-560m

3. snapshot_download方式（推荐）

snapshot_download

Download files from the Hub

huggingface 官方提供了 huggingface_hub.snapshot_download 方法下载完整模型，支持断点续传、多线程、指定路径、配置代理、排除特定文件等功能，推荐使用。

3.1 下载整个仓库

import time
from huggingface_hub import snapshot_download
import huggingface_hub
import os

os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# token 从 https://huggingface.co/settings/tokens 获取
huggingface_hub.login("HF_TOKEN")

model_id = "LinkSoul/Chinese-Llama-2-7b"
local_dir = '/root/home'
cache_dir = local_dir + "/cache"

while True:
    try:
        snapshot_download(cache_dir=cache_dir,
        local_dir=local_dir,
        repo_id=model_id,
        local_dir_use_symlinks=False,
        resume_download=True,
        allow_patterns=["*.model", "*.json", "*.bin",
        "*.py", "*.md", "*.txt"],
        ignore_patterns=["*.safetensors", "*.msgpack",
        "*.h5", "*.ot",],
        proxies={"https": "http://localhost:7777"},
  		max_workers=8,
  		etag_timeout=100
        )
    except Exception as e :
        print(e)
        # time.sleep(5)
    else:
        print('下载完成')
        break

解释说明

allow_patterns，表示需要下载的文件类型。
ignore_patterns，表示需要忽略的文件类型。
resume_download=True，表示允许断点续传。
etag_timeout=100，表示超时阈值，默认10秒，可以根据网络情况进行修改。

3.2 下载单个大模型文件

个别情况下，我们只需要下载单个模型文件，而不需要下载整个项目仓库，使用 hf_hub_download() 方法下载即可。

import time
from huggingface_hub import hf_hub_download
model_id = "LinkSoul/Chinese-Llama-2-7b" # 模型id
local_dir = '/root/home'
cache_dir = local_dir + "/cache"
filename= "pytorch_model-00001-of-00003.bin"
while True:   
    try:
        hf_hub_download(cache_dir=cache_dir,
        local_dir=local_dir,
        repo_id=model_id,
        filename=filename,
        local_dir_use_symlinks=False,
        resume_download=True,
        etag_timeout=100
        )
    except Exception as e :
        print(e)
        # time.sleep(5)
    else:
        print('下载完成')
        break

也可以使用 urllib 方法进行下载：

 import os
 import urllib.request
 
 
 def download_file(file_link, filename):
     # Checks if the file already exists before downloading
     if not os.path.isfile(filename):
         urllib.request.urlretrieve(file_link, filename)
         print("File downloaded successfully.")
     else:
         print("File already exists.")
 
 # Dowloading GGML model from HuggingFace
 file_url = "https://huggingface.co/LinkSoul/Chinese-Llama-2-7b/resolve/main/pytorch_model-00001-of-00003.bin"
 filename = "pytorch_model-00001-of-00003.bin"
 
 download_file(file_url, filename)

4. git clone方式

公开模型，可以从ModelScope魔塔社区下载，也可以从Hugging Face下载。以 google-bert/bert-base-chineset 为例。

# 先下载基础文件，跳过大文件
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/google-bert/bert-base-chinese

# 然后逐个下载LFS大文件
sudo wget 
https://huggingface.co/google-bert/bert-base-chinese/resolve/main/pytorch_model.bin

在这里插入图片描述

5. 总结

Linux/Mac OS/windows 默认推荐使用huggingface-cli，对外网连接较好（丢包少）的时候，可尝试 huggingface-cli+hf_transfer（可选）。
网络连接不好，推荐先GIT_LFS_SKIP_SMUDGE=1 git clone，其次再对大文件用第三方、成熟的多线程下载工具，Linux 推荐hfd脚本+aria2c，Windows 推荐 IDM。用第三方工具的好处是，下载上百GB的模型、数据集，你可以放个一晚上，第二天就下载好了，而不是第二天早晨发现下载了10%断了还得继续。
偶尔小文件下载，直接访问镜像站，用浏览器下载。

四、FAQ

Q：授权失败

在这里插入图片描述

错误原因：未知。

解决方法：从ModelScope魔塔社区下载模型。

Q：`requests.exceptions.ProxyError`

requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='cdn-lfs.hf-mirror.com', port=443): Max retries exceeded with url: 
...
(Caused by ProxyError('Cannot connect to proxy.', TimeoutError('timed out')))"), '(Request ID: ad23a5b4-0fea-4a93-8c35-ce9ece121f1e)')

错误原因：未设置镜像端点。

解决方法：设置镜像端点。

# 设置镜像端点
export HF_ENDPOINT="https://hf-mirror.com"

Hugging Face和ModelScope大模型/数据集的下载加速方法

一、参考资料

二、如何下载非公开模型?

1. 概述

2. 申请许可

3. 获取Access Token

4. 登录Hugging Face 账户

5. 下载模型

6. 从ModelScope下载

三、Hugging Face模型下载

0. 准备环境

1. hfd方式（推荐）

1.1 创建脚本

1.2 准备环境

1.3 hfd指令

1.4 下载模型

2. huggingface-cli+hf_transfer

2.1 huggingface-cli（不推荐）

2.2 hf_transfer（推荐）

3. snapshot_download方式（推荐）

3.1 下载整个仓库

3.2 下载单个大模型文件

4. git clone方式

5. 总结

四、FAQ

Q：授权失败

Q：requests.exceptions.ProxyError

Q：`requests.exceptions.ProxyError`