microsoft/BitNet最快推理!!!比 gpu npu 都快,而且能耗更低 !!!

microsoft/BitNet最快推理!!!

AI学习交流qq群873673497
官网turingevo.com
邮箱wmx@turingevo.com
githubhttps://github.com/turingevo
huggingfacehttps://huggingface.co/turingevo
日期论文
10/21/20241-bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs
10/17/2024bitnet.cpp 1.0 released.
03/21/2024The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ
02/27/2024The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
10/17/2023BitNet: Scaling 1-bit Transformers for Large Language Models

其中 b1.58 量化 ,直接把矩阵浮点乘法 变 整数加法,更高效、更节能
在这里插入图片描述

system

osubuntu22.04
cpux86_64

1 下载源码

源码地址 https://github.com/microsoft/BitNet

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

2 安装依赖环境

使用conda

conda create -p ./venv python=3.9
conda activate ./venv
pip install -r requirements.txt

3 下载模型

ModelParametersCPUKernel
I2_STL1TL2
bitnet_b1_58-large0.7Bx86
ARM
bitnet_b1_58-3B3.3Bx86
ARM
Llama3-8B-1.58-100B-tokens8.0Bx86
ARM

模型 Llama3-8B-1.58-100B-tokens 下载到 /media/wmx/soft1/huggingface_cache/Llama3-8B-1.58-100B-tokens

4 下载配置编译器

llvm : https://github.com/llvm/llvm-project/releases
我这下载的是 LLVM-19.1.3-Linux-X64
放在路径 /media/wmx/ws1/software/LLVM-19.1.3-Linux-X64

在 BitNet/CMakeLists.txt 添加 指定clang 编译器相关路径

set(CMAKE_CXX_STANDARD_REQUIRED true)
set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED true)
set(THREADS_PREFER_PTHREAD_FLAG ON)
# 在上面4行代码下添加 :

###------------------- wmx add llvm clang
set(CMAKE_CXX_COMPILER /media/wmx/ws1/software/LLVM-19.1.3-Linux-X64/bin/clang++)
set(CMAKE_C_COMPILER /media/wmx/ws1/software/LLVM-19.1.3-Linux-X64/bin/clang)

include_directories(/media/wmx/ws1/software/LLVM-19.1.3-Linux-X64/include)
link_directories(
    /media/wmx/ws1/software/LLVM-19.1.3-Linux-X64/lib 
    /media/wmx/ws1/software/LLVM-19.1.3-Linux-X64/lib/x86_64-unknown-linux-gnu/ 
)

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libc++")
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -stdlib=libc++ -lc++ -lc++abi")
###------------------- wmx add llvm clang

5 添加脚本 wmx_test.sh

BitNet/wmx_test.sh


# 激活虚拟环境
conda activate ./venv

export LLVM_BASE=/media/wmx/ws1/software/LLVM-19.1.3-Linux-X64
export LD_LIBRARY_PATH=$LLVM_BASE/lib:$LLVM_BASE/lib/x86_64-unknown-linux-gnu:$LD_LIBRARY_PATH

# /media/wmx/soft1/huggingface_cache/Llama3-8B-1.58-100B-tokens
# /media/wmx/soft1/huggingface_cache/bitnet_b1_58-large
raw_model=/media/wmx/soft1/huggingface_cache/Llama3-8B-1.58-100B-tokens

# 构建 & 转换模型
setup_env(){
    echo "setup_env"
    python setup_env.py -md $raw_model -q i2_s
}



# ------------------ run_inference ----------------------------------

model_path=/media/wmx/soft1/huggingface_cache/bitnet_b1_58-large/ggml-model-i2_s.gguf

p1="Daniel went back to the the the garden. \
    Mary travelled to the kitchen. Sandra journeyed to the kitchen.\
    Sandra went to the hallway. John went to the bedroom. Mary went back to the garden.\
    Where is Mary?\nAnswer:" 

p2="Introduce Mr. Trump"
run()
{
    echo "run_inference prompt is :\n $1 \n "
    python run_inference.py -m  $model_path \
    -n 6 \
    -temp 0 \
    -p "$1"
}

### usage :
# source wmx_test.sh

# setup_env

# run "$p1"
# run "$p2"

授权执行

chmod a+x wmx_test.sh

6 编译工程

1 编译工程
2 转换模型到 gguf
3 量化到 i2_s

source wmx_test.sh
setup_env

输出 :

(base) wmx@wmx-ubuntu:/media/wmx/soft1/AI-model/BitNet$ source wmx_test.sh 
(/media/wmx/soft1/AI-model/BitNet/venv) wmx@wmx-ubuntu:/media/wmx/soft1/AI-model/BitNet$ setup_env
setup_env
INFO:root:Compiling the code using CMake.
INFO:root:Loading model from directory /media/wmx/soft1/huggingface_cache/Llama3-8B-1.58-100B-tokens.
INFO:root:GGUF model save at /media/wmx/soft1/huggingface_cache/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf

7 推理

source wmx_test.sh
run "$p1"

输出:

.................................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   288.00 MiB
llama_new_context_with_model: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =     5.00 MiB
llama_new_context_with_model: graph nodes  = 870
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 2

system_info: n_threads = 2 (n_threads_batch = 2) / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 4294967295
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy 
generate: n_ctx = 2048, n_batch = 1, n_predict = 6, n_keep = 1

 Daniel went back to the the the garden.     Mary travelled to the kitchen. Sandra journeyed to the kitchen.    Sandra went to the hallway. John went to the bedroom. Mary went back to the garden.    Where is Mary?
Answer: The garden.
Daniel went

llama_perf_sampler_print:    sampling time =       0.07 ms /    63 runs   (    0.00 ms per token, 913043.48 tokens per second)
llama_perf_context_print:        load time =     117.87 ms
llama_perf_context_print: prompt eval time =     564.99 ms /    57 tokens (    9.91 ms per token,   100.89 tokens per second)
llama_perf_context_print:        eval time =      48.44 ms /     5 runs   (    9.69 ms per token,   103.22 tokens per second)
llama_perf_context_print:       total time =     614.03 ms /    62 tokens
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值