重要说明:本文从网上资料整理而来,仅记录博主学习相关知识点的过程,侵删。
一、参考资料
【实战教程】linux系统下载huggingface大模型教程
【AI之路】使用huggingface_hub优雅解决huggingface大模型下载问题
二、如何下载非公开模型?
1. 概述
由于模型发布者的版权的要求,部分模型无法公开访问下载,需要在 huggingface 上申请许可通过后,才可以下载。这类模型称之为 Gated Model
。
下载非公开模型的基本步骤包括:
- 申请许可。
- 获取 access token(用于命令行和python方法调用)。
- 下载模型。
2. 申请许可
需要注意,必须在 huggingface 官网注册登录后申请许可,由于网络安全原因,镜像站一般不支持。
申请后一般等待几分钟到几天不等(一般几分钟就行),会发邮件通知你审批结果。
3. 获取Access Token
Access Token 获取地址:https://huggingface.co/settings/tokens
访问 huggingface 设置页面的 token 管理页,选择 New 一个 token,只需要 Read
权限即可,创建后便可以在工具中调用时使用了。
申请通过后,就可以在模型主页的 Files and versions
中看到模型文件了,浏览器的话直接点击下载即可。但是如果想要用工具例如 huggingface-cli
下载,则需要获取Access Token。
4. 登录Hugging Face 账户
推荐使用下述命令登录您的 Hugging Face 账户。
pip install --upgrade huggingface_hub
huggingface-cli login
(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|
_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|
_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|
_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|
To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.
git config --global credential.helper store
Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful
5. 下载模型
https
git clone https://<hf_username>:<hf_token>@huggingface.co/meta-llama/Llama-2-7b-chat-hf
huggingface-cli: 添加 --token 参数
huggingface-cli download --token hf_xxx --resume-download bigscience/bloom-560m --local-dir bloom-560m
curl, wget:在 header 中添加 token
curl -L --header "Authorization: Bearer hf_xxx" -o model-00001-of-00002.safetensors https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors
wget --header "Authorization: Bearer hf_xxx" https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/model-00001-of-00002.safetensors
snapshot_download:调用 login 方法
import huggingface_hub
huggingface_hub.login("hf_xxx")
6. 从ModelScope下载
LLM-Research/Meta-Llama-3-8B-Instruct
下载非公开模型,推荐从ModelScope魔塔社区下载。以 LLM-Research/Meta-Llama-3-8B-Instruct
为例。
git clone https://www.modelscope.cn/llm-research/meta-llama-3-8b-instruct.git
三、Hugging Face模型下载
0. 准备环境
# 安装 git-lfs
sudo apt-get install git-lfs
git lfs install
# 安装huggingface_hub
pip install -U huggingface_hub
# 安装 aria2
sudo apt install aria2
# 设置镜像端点
export HF_ENDPOINT="https://hf-mirror.com"
1. hfd方式(推荐)
hfd 是基于 Git 和 aria2 实现的专用于huggingface 下载的命令行脚本。
专用多线程下载器 hfd的原理:
- Step1:Git clone 项目仓库中lfs文件之外的所有文件,并自动获取 lfs 文件的 url;
- Step2:利用
aria2
多线程下载文件。
1.1 创建脚本
创建 hfd.sh
文件:
#!/usr/bin/env bash
# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color
trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT
display_help() {
cat << EOF
Usage:
hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
exit 1
}
MODEL_ID=$1
shift
# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://huggingface.co"}
while [[ $# -gt 0 ]]; do
case $1 in
--include) INCLUDE_PATTERN="$2"; shift 2 ;;
--exclude) EXCLUDE_PATTERN="$2"; shift 2 ;;
--hf_username) HF_USERNAME="$2"; shift 2 ;;
--hf_token) HF_TOKEN="$2"; shift 2 ;;
--tool) TOOL="$2"; shift 2 ;;
-x) THREADS="$2"; shift 2 ;;
--dataset) DATASET=1; shift ;;
--local-dir) LOCAL_DIR="$2"; shift 2 ;;
*) shift ;;
esac
done
# Check if aria2, wget, curl, git, and git-lfs are installed
check_command() {
if ! command -v $1 &>/dev/null; then
echo -e "${RED}$1 is not installed. Please install it first.${NC}"
exit 1
fi
}
# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership() {
if git status 2>&1 | grep "fatal: detected dubious ownership in repository at" > /dev/null; then
git config --global --add safe.directory "${PWD}"
printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}"
fi
}
[[ "$TOOL" == "aria2c" ]] && check_command aria2c
[[ "$TOOL" == "wget" ]] && check_command wget
check_command curl; check_command git; check_command git-lfs
[[ -z "$MODEL_ID" || "$MODEL_ID" =~ ^-h ]] && display_help
if [[ -z "$LOCAL_DIR" ]]; then
LOCAL_DIR="${MODEL_ID#*/}"
fi
if [[ "$DATASET" == 1 ]]; then
MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"
if [ -d "$LOCAL_DIR/.git" ]; then
printf "${YELLOW}%s exists, Skip Clone.\n${NC}" "$LOCAL_DIR"
cd "$LOCAL_DIR" && ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull || { printf "${RED}Git pull failed.${NC}\n"; exit 1; }
else
REPO_URL="$HF_ENDPOINT/$MODEL_ID"
GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
response=$(curl -s -o /dev/null -w "%{http_code}" "$GIT_REFS_URL")
if [ "$response" == "401" ] || [ "$response" == "403" ]; then
if [[ -z "$HF_USERNAME" || -z "$HF_TOKEN" ]]; then
printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
exit 1
fi
REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"
elif [ "$response" != "200" ]; then
printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n" "$GIT_REFS_URL"
curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
fi
echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"
GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR" || { printf "${RED}Git clone failed.\n${NC}"; exit 1; }
ensure_ownership
while IFS= read -r file; do
truncate -s 0 "$file"
done <<< $(git lfs ls-files | cut -d ' ' -f 3-)
fi
printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' ' -f 3-)
declare -a urls
while IFS= read -r file; do
url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
file_dir=$(dirname "$file")
mkdir -p "$file_dir"
if [[ "$TOOL" == "wget" ]]; then
download_cmd="wget -c \"$url\" -O \"$file\""
[[ -n "$HF_TOKEN" ]] && download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""
else
download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
[[ -n "$HF_TOKEN" ]] && download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
fi
[[ -n "$INCLUDE_PATTERN" && ! "$file" == $INCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
[[ -n "$EXCLUDE_PATTERN" && "$file" == $EXCLUDE_PATTERN ]] && printf "# %s\n" "$download_cmd" && continue
printf "%s\n" "$download_cmd"
urls+=("$url|$file")
done <<< "$files"
for url_file in "${urls[@]}"; do
IFS='|' read -r url file <<< "$url_file"
printf "${YELLOW}Start downloading ${file}.\n${NC}"
file_dir=$(dirname "$file")
if [[ "$TOOL" == "wget" ]]; then
[[ -n "$HF_TOKEN" ]] && wget --header="Authorization: Bearer ${HF_TOKEN}" -c "$url" -O "$file" || wget -c "$url" -O "$file"
else
[[ -n "$HF_TOKEN" ]] && aria2c --header="Authorization: Bearer ${HF_TOKEN}" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")" || aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url" -d "$file_dir" -o "$(basename "$file")"
fi
[[ $? -eq 0 ]] && printf "Downloaded %s successfully.\n" "$url" || { printf "${RED}Failed to download %s.\n${NC}" "$url"; exit 1; }
done
printf "${GREEN}Download completed successfully.\n${NC}"
1.2 准备环境
执行脚本之前,务必安装 git-lfs
和 aria2
,请参考前文【准备环境】。
1.3 hfd指令
# 修改权限
chmod a+x hfd.sh
# 创建指令
alias hfd="$PWD/hfd.sh"
# 查看指令
hfd -h
root@notebook-1813389960667746306-scnlbe5oi5-79366:/public/home/scnlbe5oi5/360Downloads# hfd -h
Usage:
hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]
Description:
Downloads a model or dataset from Hugging Face using the provided repo ID.
Parameters:
repo_id The Hugging Face repo ID in the format 'org/repo_name'.
--include (Optional) Flag to specify a string pattern to include files for downloading.
--exclude (Optional) Flag to specify a string pattern to exclude files from downloading.
include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
--hf_username (Optional) Hugging Face username for authentication. **NOT EMAIL**.
--hf_token (Optional) Hugging Face token for authentication.
--tool (Optional) Download tool to use. Can be aria2c (default) or wget.
-x (Optional) Number of download threads for aria2c. Defaults to 4.
--dataset (Optional) Flag to indicate downloading a dataset.
--local-dir (Optional) Local directory path where the model or dataset will be stored.
Example:
hfd bigscience/bloom-560m --exclude *.safetensors
hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
hfd lavita/medical-qa-shared-task-v1-toy --dataset
1.4 下载模型
查看 Access Tokens:https://huggingface.co/settings/tokens
# 无token下载
hfd bigscience/bloom-560m --tool aria2c -x 12
# 排除不下载的文件
hfd bigscience/bloom-560m --exclude *.safetensors
# 有token下载
hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN
如果 aria2c 下载较慢,尝试使用设置镜像端点或者使用 wget 下载工具:
# 设置wget下载工具
hfd bigscience/bloom-560m --tool wget -x 12
下载 bigscience/bloom-560m
示例如下:
root@notebook-1813389960667746306-scnlbe5oi5-12495:/public/home/scnlbe5oi5/Downloads/models# hfd bigscience/bloom-560m
Downloading to bloom-560m
Testing GIT_REFS_URL: https://hf-mirror.com/bigscience/bloom-560m/info/refs?service=git-upload-pack
GIT_LFS_SKIP_SMUDGE=1 git clone https://hf-mirror.com/bigscience/bloom-560m bloom-560m
Cloning into 'bloom-560m'...
remote: Enumerating objects: 111, done.
remote: Counting objects: 100% (111/111), done.
remote: Compressing objects: 100% (55/55), done.
remote: Total 111 (delta 55), reused 111 (delta 55), pack-reused 0 (from 0)
Receiving objects: 100% (111/111), 28.49 KiB | 1.58 MiB/s, done.
Resolving deltas: 100% (55/55), done.
Start Downloading lfs files, bash script:
cd bloom-560m
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/flax_model.msgpack" -d "." -o "flax_model.msgpack"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/model.safetensors" -d "." -o "model.safetensors"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx" -d "onnx" -o "decoder_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx_data" -d "onnx" -o "decoder_model.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx" -d "onnx" -o "decoder_model_merged.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx_data" -d "onnx" -o "decoder_model_merged.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx" -d "onnx" -o "decoder_with_past_model.onnx"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx_data" -d "onnx" -o "decoder_with_past_model.onnx_data"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/tokenizer.json" -d "onnx" -o "tokenizer.json"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/pytorch_model.bin" -d "." -o "pytorch_model.bin"
aria2c --console-log-level=error --file-allocation=none -x 4 -s 4 -k 1M -c "https://hf-mirror.com/bigscience/bloom-560m/resolve/main/tokenizer.json" -d "." -o "tokenizer.json"
Start downloading flax_model.msgpack.
[#33d36a 1.0GiB/1.0GiB(100%) CN:1]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
33d36a|OK | 47MiB/s|./flax_model.msgpack
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/flax_model.msgpack successfully.
Start downloading model.safetensors.
[#0ff500 1.0GiB/1.0GiB(99%) CN:4 DL:43MiB]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
0ff500|OK | 44MiB/s|./model.safetensors
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/model.safetensors successfully.
Start downloading onnx/decoder_model.onnx.
[#75f561 32KiB/691KiB(4%) CN:1 DL:141KiB ETA:4s]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
75f561|OK | 1.1MiB/s|onnx/decoder_model.onnx
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx successfully.
Start downloading onnx/decoder_model.onnx_data.
*** Download Progress Summary as of Tue Aug 6 17:46:12 2024 ***
======================================================================================================
[#3b16c1 2.4GiB/3.0GiB(80%) CN:4 DL:42MiB ETA:14s]
FILE: onnx/decoder_model.onnx_data
------------------------------------------------------------------------------------------------------
[#3b16c1 3.0GiB/3.0GiB(99%) CN:2 DL:35MiB]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
3b16c1|OK | 42MiB/s|onnx/decoder_model.onnx_data
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx_data successfully.
Start downloading onnx/decoder_model_merged.onnx.
[#dd17b7 0B/0B CN:1 DL:0B]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
dd17b7|OK | 3.0MiB/s|onnx/decoder_model_merged.onnx
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx successfully.
Start downloading onnx/decoder_model_merged.onnx_data.
[#1b793b 3.0GiB/3.0GiB(99%) CN:3 DL:52MiB]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
1b793b|OK | 52MiB/s|onnx/decoder_model_merged.onnx_data
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_model_merged.onnx_data successfully.
Start downloading onnx/decoder_with_past_model.onnx.
[#c371d9 0B/0B CN:1 DL:0B]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
c371d9|OK | 1.0MiB/s|onnx/decoder_with_past_model.onnx
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx successfully.
Start downloading onnx/decoder_with_past_model.onnx_data.
*** Download Progress Summary as of Tue Aug 6 17:48:32 2024 ***
======================================================================================================
[#1988c4 2.6GiB/3.0GiB(85%) CN:4 DL:48MiB ETA:9s]
FILE: onnx/decoder_with_past_model.onnx_data
------------------------------------------------------------------------------------------------------
[#1988c4 3.0GiB/3.0GiB(99%) CN:1 DL:32MiB]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
1988c4|OK | 44MiB/s|onnx/decoder_with_past_model.onnx_data
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/decoder_with_past_model.onnx_data successfully.
Start downloading onnx/tokenizer.json.
[#300f33 13MiB/13MiB(94%) CN:2 DL:4.3MiB]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
300f33|OK | 4.1MiB/s|onnx/tokenizer.json
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/onnx/tokenizer.json successfully.
Start downloading pytorch_model.bin.
[#0ad147 1.0GiB/1.0GiB(99%) CN:2 DL:22MiB]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
0ad147|OK | 24MiB/s|./pytorch_model.bin
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/pytorch_model.bin successfully.
Start downloading tokenizer.json.
[#283f59 7.5MiB/13MiB(54%) CN:4 DL:3.7MiB ETA:1s]
Download Results:
gid |stat|avg speed |path/URI
======+====+===========+=======================================================
283f59|OK | 5.5MiB/s|./tokenizer.json
Status Legend:
(OK):download completed.
Downloaded https://hf-mirror.com/bigscience/bloom-560m/resolve/main/tokenizer.json successfully.
Download completed successfully.
2. huggingface-cli+hf_transfer
huggingface-cli
和 hf_transfer
是 hugging face 官方提供的专门为下载而设计的工具链。前者是一个命令行工具,后者是下载加速模块。
2.1 huggingface-cli(不推荐)
huggingface-cli 隶属于 huggingface_hub
库,不仅可以下载模型、数据,还可以可以登录huggingface、上传模型、数据等。
pip install -U huggingface_hub
# 下载模型(断点续传)
huggingface-cli download --resume-download bigscience/bloom-560m --local-dir bloom-560m
# 下载数据集
huggingface-cli download --resume-download --repo-type dataset lavita/medical-qa-shared-task-v1-toy
值得注意的是,有个--local-dir-use-symlinks False
参数可选,因为huggingface的工具链默认会使用符号链接来存储下载的文件,导致--local-dir
指定的目录中都是一些“链接文件”,真实模型则存储在~/.cache/huggingface
下,如果不喜欢这个可以用 --local-dir-use-symlinks False
取消这个逻辑。
2.2 hf_transfer(推荐)
hf_transfer 依附并兼容 huggingface-cli
,是 hugging face 官方专门为提高下载速度基于 Rust 开发的一个模块,开启后在带宽充足的机器上可以跑满带宽。
# 安装hf_transfer
pip install -U hf-transfer
# 设置环境变量
export HF_HUB_ENABLE_HF_TRANSFER=1
# 开启后使用方法同 huggingface-cli
huggingface-cli download --resume-download bigscience/bloom-560m --local-dir bloom-560m
3. snapshot_download方式(推荐)
huggingface 官方提供了 huggingface_hub.snapshot_download
方法下载完整模型,支持断点续传、多线程、指定路径、配置代理、排除特定文件等功能,推荐使用。
3.1 下载整个仓库
import time
from huggingface_hub import snapshot_download
import huggingface_hub
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
# token 从 https://huggingface.co/settings/tokens 获取
huggingface_hub.login("HF_TOKEN")
model_id = "LinkSoul/Chinese-Llama-2-7b"
local_dir = '/root/home'
cache_dir = local_dir + "/cache"
while True:
try:
snapshot_download(cache_dir=cache_dir,
local_dir=local_dir,
repo_id=model_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.model", "*.json", "*.bin",
"*.py", "*.md", "*.txt"],
ignore_patterns=["*.safetensors", "*.msgpack",
"*.h5", "*.ot",],
proxies={"https": "http://localhost:7777"},
max_workers=8,
etag_timeout=100
)
except Exception as e :
print(e)
# time.sleep(5)
else:
print('下载完成')
break
解释说明
allow_patterns
,表示需要下载的文件类型。ignore_patterns
,表示需要忽略的文件类型。resume_download=True
,表示允许断点续传。etag_timeout=100
,表示超时阈值,默认10秒,可以根据网络情况进行修改。
3.2 下载单个大模型文件
个别情况下,我们只需要下载单个模型文件,而不需要下载整个项目仓库,使用 hf_hub_download()
方法下载即可。
import time
from huggingface_hub import hf_hub_download
model_id = "LinkSoul/Chinese-Llama-2-7b" # 模型id
local_dir = '/root/home'
cache_dir = local_dir + "/cache"
filename= "pytorch_model-00001-of-00003.bin"
while True:
try:
hf_hub_download(cache_dir=cache_dir,
local_dir=local_dir,
repo_id=model_id,
filename=filename,
local_dir_use_symlinks=False,
resume_download=True,
etag_timeout=100
)
except Exception as e :
print(e)
# time.sleep(5)
else:
print('下载完成')
break
也可以使用 urllib
方法进行下载:
import os
import urllib.request
def download_file(file_link, filename):
# Checks if the file already exists before downloading
if not os.path.isfile(filename):
urllib.request.urlretrieve(file_link, filename)
print("File downloaded successfully.")
else:
print("File already exists.")
# Dowloading GGML model from HuggingFace
file_url = "https://huggingface.co/LinkSoul/Chinese-Llama-2-7b/resolve/main/pytorch_model-00001-of-00003.bin"
filename = "pytorch_model-00001-of-00003.bin"
download_file(file_url, filename)
4. git clone方式
公开模型,可以从ModelScope魔塔社区下载,也可以从Hugging Face下载。以 google-bert/bert-base-chineset
为例。
# 先下载基础文件,跳过大文件
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/google-bert/bert-base-chinese
# 然后逐个下载LFS大文件
sudo wget
https://huggingface.co/google-bert/bert-base-chinese/resolve/main/pytorch_model.bin
5. 总结
- Linux/Mac OS/windows 默认推荐使用
huggingface-cli
,对外网连接较好(丢包少)的时候,可尝试huggingface-cli
+hf_transfer
(可选)。 - 网络连接不好,推荐先
GIT_LFS_SKIP_SMUDGE=1 git clone
,其次再对大文件用第三方、成熟的多线程下载工具,Linux 推荐hfd脚本+aria2c,Windows 推荐 IDM。用第三方工具的好处是,下载上百GB的模型、数据集,你可以放个一晚上,第二天就下载好了,而不是第二天早晨发现下载了10%断了还得继续。 - 偶尔小文件下载,直接访问镜像站,用浏览器下载。
四、FAQ
Q:授权失败
错误原因:未知。
解决方法:从ModelScope魔塔社区下载模型。
Q:requests.exceptions.ProxyError
requests.exceptions.ProxyError: (MaxRetryError("HTTPSConnectionPool(host='cdn-lfs.hf-mirror.com', port=443): Max retries exceeded with url:
...
(Caused by ProxyError('Cannot connect to proxy.', TimeoutError('timed out')))"), '(Request ID: ad23a5b4-0fea-4a93-8c35-ce9ece121f1e)')
错误原因:未设置镜像端点。
解决方法:设置镜像端点。
# 设置镜像端点
export HF_ENDPOINT="https://hf-mirror.com"