Hugging Face和ModelScope大模型/数据集的下载加速方法

重要说明:本文从网上资料整理而来,仅记录博主学习相关知识点的过程,侵删。

一、参考资料

【实战教程】linux系统下载huggingface大模型教程

如何快速下载huggingface模型——全方法总结

【AI之路】使用huggingface_hub优雅解决huggingface大模型下载问题

二、准备环境

# 安装git-lfs
sudo apt-get install git-lfs
git lfs install

# 安装huggingface_hub
pip install -U huggingface_hub
# 设置镜像端点
export HF_ENDPOINT="https://hf-mirror.com"

三、Hugging Face模型下载

Download files from the Hub

1. hfd方式(推荐)

Huggingface Model Downloader

hfd 是基于 Git 和 aria2 实现的专用于huggingface 下载的命令行脚本。

专用多线程下载器 hfd的原理:

  • Step1:Git clone 项目仓库中lfs文件之外的所有文件,并自动获取 lfs 文件的 url;
  • Step2:利用 aria2 多线程下载文件。

1.1 创建脚本

创建 hfd.sh文件:

#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT

display_help() {
    cat << EOF
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]    

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
    exit 1
}

MODEL_ID=$1
shift

# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}

while [[ $# -gt 0 ]]; do
    case $1 in
        --include) INCLUDE_PATTERN="$2"; shift 2 ;;
        --exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
        --hf_username) HF_USERNAME="$2"; shift 2 ;;
        --hf_token) HF_TOKEN="$2"; shift 2 ;;
        --tool) TOOL="$2"; shift 2 ;;
        -x) THREADS="$2"; shift 2 ;;
        --dataset) DATASET=1; shift ;;
        --local-dir) LOCAL_DIR="$2"; shift 2 ;;
        *) shift ;;
    esac
done

# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
    if ! command -v $1 &>/dev/null; then
        echo -e "${RED}$1 is not installed. Please install it first.${NC}"
        exit 1
    fi
}

# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
    if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
        git config --global --add safe.directory "${PWD}"
        printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" 
    fi
}

[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs

[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help

if [[ -z "$LOCAL_DIR" ]]; then
    LOCAL_DIR="${MODEL_ID#*/}"
fi

if [[ "$DATASET" == 1 ]]; then
    MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"

if [ -d "$LOCAL_DIR/.git" ]; then
    printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
    cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
    REPO_URL="$HF_ENDPOINT/$MODEL_ID"
    GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
    echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
    response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
    if [ "$response" == "401" ] || [ "$response" == "403" ]; then
        if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
            printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
            exit 1
        fi
        REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
    elif [ "$response" != "200" ]; then
        printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
        printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
        curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
    fi
    echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"

    GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }

    ensure_ownership

    while IFS= read -r file; do
        truncate -s 0 "$file"
    done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi

printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls

while IFS= read -r file; do
    url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
    file_dir=$(dirname "$file")
    mkdir -p "$file_dir"
    if [[ "$TOOL" == "wget" ]]; then
        download_cmd="wget -c \"$url\" -O \"$file\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
    else
        download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
        [[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
    fi
    [[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    [[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
    printf "%s\n" "$download_cmd"
    urls+=("$url|$file")
done <<< "$files"

for url_file in "${urls[@]}"; do
    IFS='|' read -r url file <<< "$url_file"
    printf "${YELLOW}Start downloading ${file}.\n${NC}" 
    file_dir=$(dirname "$file")
    if [[ "$TOOL" == "wget" ]]; then
        [[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
    else
        [[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
    fi
    [[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done

printf "${GREEN}Download completed successfully.\n${NC}"

1.2 环境准备

执行脚本之前,务必安装 aria2。

1.3 hfd指令

# 修改权限
chmod a+x hfd.sh

# 创建指令
alias hfd="$PWD/hfd.sh"

# 查看指令
hfd -h
root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/360Downloads# hfd -h
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

1.4 下载模型

查看Access Tokens:https://huggingface.co/settings/tokens

# 无token下载
hdf bigscience/bloom-560m --tool aria2c -x 12

# 排除不下载的文件
hfd bigscience/bloom-560m --exclude *.safetensors

# 有token下载
hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

如果 aria2c 下载较慢,尝试使用设置镜像端点或者使用 wget 下载工具:

# 设置wget下载工具
hfd bigscience/bloom-560m --tool wget -x 12

下载 bigscience/bloom-560m 示例如下:

root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models# hfd bigscience/bloom-560m
Downloading to bloom-560m
Testing GIT_REFS_URL: https://hf-mirror.com/bigscience/bloom-560m/info/refs?service=git-upload-pack
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/bigscience/bloom-560m bloom-560m
Cloning into 'bloom-560m'...
remote: Enumerating objects: 111, done.
remote: Counting objects: 100% (111/111), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 111 (delta 55), reused 111 (delta 55), pack-reused 0 (from 0)
Receiving objects: 100% (111/111), 28.49 KiB | 1.58 MiB/s, done.
Resolving deltas: 100% (55/55), done.

Start Downloading lfs files, bash script:
cd bloom-560m
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/model.safetensors" -d "." -o "model.safetensors"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx_data" -d "onnx" -o "decoder_model.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx_data" -d "onnx" -o "decoder_model_merged.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx_data" -d "onnx" -o "decoder_with_past_model.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/tokenizer.json" -d "onnx" -o "tokenizer.json"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/tokenizer.json" -d "." -o "tokenizer.json"
Start downloading flax_model.msgpack.
[#33d36a 1.0GiB/1.0GiB(100%) CN:1]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
33d36a|OK  |    47MiB/s|./flax_model.msgpack

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/flax_model.msgpack successfully.
Start downloading model.safetensors.
[#0ff500 1.0GiB/1.0GiB(99%) CN:4 DL:43MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
0ff500|OK  |    44MiB/s|./model.safetensors

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/model.safetensors successfully.
Start downloading onnx/decoder_model.onnx.
[#75f561 32KiB/691KiB(4%) CN:1 DL:141KiB ETA:4s]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
75f561|OK  |   1.1MiB/s|onnx/decoder_model.onnx

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx successfully.
Start downloading onnx/decoder_model.onnx_data.
 *** Download Progress Summary as of Tue Aug  6 17:46:12 2024 ***
======================================================================================================
[#3b16c1 2.4GiB/3.0GiB(80%) CN:4 DL:42MiB ETA:14s]
FILE: onnx/decoder_model.onnx_data
------------------------------------------------------------------------------------------------------

[#3b16c1 3.0GiB/3.0GiB(99%) CN:2 DL:35MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
3b16c1|OK  |    42MiB/s|onnx/decoder_model.onnx_data

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx_data successfully.
Start downloading onnx/decoder_model_merged.onnx.
[#dd17b7 0B/0B CN:1 DL:0B]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
dd17b7|OK  |   3.0MiB/s|onnx/decoder_model_merged.onnx

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx successfully.
Start downloading onnx/decoder_model_merged.onnx_data.
[#1b793b 3.0GiB/3.0GiB(99%) CN:3 DL:52MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1b793b|OK  |    52MiB/s|onnx/decoder_model_merged.onnx_data

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx_data successfully.
Start downloading onnx/decoder_with_past_model.onnx.
[#c371d9 0B/0B CN:1 DL:0B]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
c371d9|OK  |   1.0MiB/s|onnx/decoder_with_past_model.onnx

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx successfully.
Start downloading onnx/decoder_with_past_model.onnx_data.
 *** Download Progress Summary as of Tue Aug  6 17:48:32 2024 ***
======================================================================================================
[#1988c4 2.6GiB/3.0GiB(85%) CN:4 DL:48MiB ETA:9s]
FILE: onnx/decoder_with_past_model.onnx_data
------------------------------------------------------------------------------------------------------

[#1988c4 3.0GiB/3.0GiB(99%) CN:1 DL:32MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
1988c4|OK  |    44MiB/s|onnx/decoder_with_past_model.onnx_data

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx_data successfully.
Start downloading onnx/tokenizer.json.
[#300f33 13MiB/13MiB(94%) CN:2 DL:4.3MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
300f33|OK  |   4.1MiB/s|onnx/tokenizer.json

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/tokenizer.json successfully.
Start downloading pytorch_model.bin.
[#0ad147 1.0GiB/1.0GiB(99%) CN:2 DL:22MiB]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
0ad147|OK  |    24MiB/s|./pytorch_model.bin

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/pytorch_model.bin successfully.
Start downloading tokenizer.json.
[#283f59 7.5MiB/13MiB(54%) CN:4 DL:3.7MiB ETA:1s]
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
283f59|OK  |   5.5MiB/s|./tokenizer.json

Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/tokenizer.json successfully.
Download completed successfully.

2. huggingface-cli+hf_transfer

huggingface-clihf_transfer 是 hugging face 官方提供的专门为下载而设计的工具链。前者是一个命令行工具,后者是下载加速模块。

2.1 huggingface-cli(不推荐)

huggingface-cli 隶属于 huggingface_hub 库,不仅可以下载模型、数据,还可以可以登录huggingface、上传模型、数据等。

pip install -U huggingface_hub
# 下载模型(断点续传)
huggingface-cli download --resume-download bigscience/bloom-560m --local-dir bloom-560m

# 下载数据集
huggingface-cli download --resume-download --repo-type dataset lavita/medical-qa-shared-task-v1-toy

值得注意的是,有个--local-dir-use-symlinks False 参数可选,因为huggingface的工具链默认会使用符号链接来存储下载的文件,导致--local-dir指定的目录中都是一些“链接文件”,真实模型则存储在~/.cache/huggingface下,如果不喜欢这个可以用 --local-dir-use-symlinks False取消这个逻辑。

2.2 hf_transfer(推荐)

hf_transfer 依附并兼容 huggingface-cli,是 hugging face 官方专门为提高下载速度基于 Rust 开发的一个模块,开启后在带宽充足的机器上可以跑满带宽。

# 安装hf_transfer
pip install -U hf-transfer

# 设置环境变量
export HF_HUB_ENABLE_HF_TRANSFER=1

# 开启后使用方法同 huggingface-cli
huggingface-cli download --resume-download bigscience/bloom-560m --local-dir bloom-560m

3. snapshot_download方式(推荐)

snapshot_download

Download files from the Hub

huggingface 官方提供了 huggingface_hub.snapshot_download 方法下载完整模型,支持断点续传、多线程、指定路径、配置代理、排除特定文件等功能,推荐使用。

3.1 下载整个仓库

import time
from huggingface_hub import snapshot_download
import huggingface_hub
import os

os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# token 从 https://huggingface.co/settings/tokens 获取
huggingface_hub.login("HF_TOKEN")

model_id = "LinkSoul/Chinese-Llama-2-7b"
local_dir = '/root/home'
cache_dir = local_dir + "/cache"

while True:
    try:
        snapshot_download(cache_dir=cache_dir,
        local_dir=local_dir,
        repo_id=model_id,
        local_dir_use_symlinks=False,
        resume_download=True,
        allow_patterns=["*.model", "*.json", "*.bin",
        "*.py", "*.md", "*.txt"],
        ignore_patterns=["*.safetensors", "*.msgpack",
        "*.h5", "*.ot",],
        proxies={"https": "http://localhost:7777"},
  		max_workers=8,
  		etag_timeout=100
        )
    except Exception as e :
        print(e)
        # time.sleep(5)
    else:
        print('下载完成')
        break

解释说明

  • allow_patterns,表示需要下载的文件类型。
  • ignore_patterns,表示需要忽略的文件类型。
  • resume_download=True,表示允许断点续传。
  • etag_timeout=100,表示超时阈值,默认10秒,可以根据网络情况进行修改。

3.2 下载单个大模型文件

个别情况下,我们只需要下载单个模型文件,而不需要下载整个项目仓库,使用 hf_hub_download() 方法下载即可。

import time
from huggingface_hub import hf_hub_download
model_id = "LinkSoul/Chinese-Llama-2-7b" # 模型id
local_dir = '/root/home'
cache_dir = local_dir + "/cache"
filename= "pytorch_model-00001-of-00003.bin"
while True:   
    try:
        hf_hub_download(cache_dir=cache_dir,
        local_dir=local_dir,
        repo_id=model_id,
        filename=filename,
        local_dir_use_symlinks=False,
        resume_download=True,
        etag_timeout=100
        )
    except Exception as e :
        print(e)
        # time.sleep(5)
    else:
        print('下载完成')
        break

也可以使用 urllib 方法进行下载:

 import os
 import urllib.request
 
 
 def download_file(file_link, filename):
     # Checks if the file already exists before downloading
     if not os.path.isfile(filename):
         urllib.request.urlretrieve(file_link, filename)
         print("File downloaded successfully.")
     else:
         print("File already exists.")
 
 # Dowloading GGML model from HuggingFace
 file_url = "https://huggingface.co/LinkSoul/Chinese-Llama-2-7b/resolve/main/pytorch_model-00001-of-00003.bin"
 filename = "pytorch_model-00001-of-00003.bin"
 
 download_file(file_url, filename)

4. git clone方式

公开模型,可以从ModelScope魔塔社区下载,也可以从Hugging Face下载。以 google-bert/bert-base-chineset 为例。

# 先下载基础文件,跳过大文件
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/google-bert/bert-base-chinese

# 然后逐个下载LFS大文件
sudo wget 
https://huggingface.co/google-bert/bert-base-chinese/resolve/main/pytorch_model.bin

在这里插入图片描述

5. 总结

  • Linux/Mac OS/windows 默认推荐使用huggingface-cli,对外网连接较好(丢包少)的时候,可尝试 huggingface-cli+hf_transfer(可选)。
  • 网络连接不好,推荐先GIT_LFS_SKIP_SMUDGE=1 git clone其次再对大文件用第三方、成熟的多线程下载工具,Linux 推荐hfd脚本+aria2c,Windows 推荐 IDM。用第三方工具的好处是,下载上百GB的模型、数据集,你可以放个一晚上,第二天就下载好了,而不是第二天早晨发现下载了10%断了还得继续。
  • 偶尔小文件下载,直接访问镜像站,用浏览器下载。

四、如何下载非公开模型?

1. 概述

由于模型发布者的版权的要求,部分模型无法公开访问下载,需要在 huggingface 上申请许可通过后,才可以下载。这类模型称之为 Gated Model

下载非公开模型的基本步骤包括:

  1. 申请许可。
  2. 获取 access token(用于命令行和python方法调用)。
  3. 下载模型。

2. 申请许可

需要注意,必须在 huggingface 官网注册登录后申请许可,由于网络安全原因,镜像站一般不支持。

申请后一般等待几分钟到几天不等(一般几分钟就行),会发邮件通知你审批结果。

在这里插入图片描述

在这里插入图片描述

3. 获取Access Token

Access Token 获取地址:https://huggingface.co/settings/tokens

访问 huggingface 设置页面的 token 管理页,选择 New 一个 token,只需要 Read 权限即可,创建后便可以在工具中调用时使用了。

申请通过后,就可以在模型主页的 Files and versions 中看到模型文件了,浏览器的话直接点击下载即可。但是如果想要用工具例如 huggingface-cli 下载,则需要获取Access Token。

在这里插入图片描述

4. 登录Hugging Face 账户

推荐使用下述命令登录您的 Hugging Face 账户。

pip install --upgrade huggingface_hub
huggingface-cli login
(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful

5. 下载模型

https

git clone https://<hf_username>:<hf_token>@huggingface.co/meta-llama/Llama-2-7b-chat-hf

huggingface-cli: 添加 --token 参数

huggingface-cli download --token hf_xxx --resume-download bigscience/bloom-560m --local-dir bloom-560m

curl, wget:在 header 中添加 token

curl -L --header "Authorization: Bearer hf_xxx" -o model-00001-of-00002.safetensors https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors
wget --header "Authorization: Bearer hf_xxx" https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors

snapshot_download:调用 login 方法

import huggingface_hub
huggingface_hub.login("hf_xxx")

6. 从ModelScope下载

LLM-Research/Meta-Llama-3-8B-Instruct

下载非公开模型,推荐从ModelScope魔塔社区下载。以 LLM-Research/Meta-Llama-3-8B-Instruct 为例。

git clone https://www.modelscope.cn/llm-research/meta-llama-3-8b-instruct.git

在这里插入图片描述

五、FAQ

Q:授权失败

在这里插入图片描述

在这里插入图片描述

错误原因:未知。

解决方法:从ModelScope魔塔社区下载模型。

Q:requests.exceptions.ProxyError

requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='cdn-lfs.hf-mirror.com', port=443): Max retries exceeded with url: 
...
(Caused by ProxyError('Cannot connect to proxy.', TimeoutError('timed out')))"), '(Request ID: ad23a5b4-0fea-4a93-8c35-ce9ece121f1e)')

错误原因:未设置镜像端点。

解决方法:设置镜像端点。

# 设置镜像端点
export HF_ENDPOINT="https://hf-mirror.com"
  • 32
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

花花少年

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值